Reactome Curator Guide

From ReactomeWiki

Jump to: navigation, search

A Guide for Curators and the Pathologically Curious

Put it into practice at http://www.reactome.org

Contents

Introduction

This document provides basic guidelines to the curator as to how data should be entered into Reactome via the Curator Tool. The goal of Reactome annotation is to take a given biological process or a "topic" in biology, and represent it as a network of reactions. The process of converting a biological topic should maintain the referential integrity of the reactions, both newly added as well as those present in Reactome already. It is assumed that you (the reader of this guide) have some familiarity with Reactome, and have read the Reactome papers. They are freely available as PDF files.

Though this document is oriented toward the Reactome curator there is much here for outside groups using Reactome. The entire dataset, website, curator, and author tools can be downloaded from here. Installation and configuration instructions are also provided here. If you have any problems with the install or tools please contact us at help@reactome.org.

As a Reactome curator you have access to the Reactome cvs repository that is used to manage the documents, files, and programs that the Reactome group is collectively working on. The repository is accessible from your account on one of the lab servers. CVS is the Concurrent Versions System, an open-source version control system. For a convenient summary of cvs commands, see "Linux in a Nutshell", current edition. ([1] OReilly]) , Chapter 14 - "CVS and RCS" for a comprehensive discussion. The repository is named GKB (GKB is the acronym for Genome KnowlegeBase, an early version of Reactome). A number of useful documents can be found within the GKB/docs directory and are further organized within the following subdirectories:

/Connection_builder - connection builder related promo stuff
/GK_docs - old documents, of possible historical interest
/Promotional - for poster, workshop, demo material etc.
/Reactome_manuals - author, curator, reviewer guides
/SOP - standard operating procedures - release notes, slice checking, DB rollback

Now with the paperwork out of the way, onto the structure of the Reactome data model…

The main concepts in the Reactome data model are Event and PhysicalEntity. Events are of two types: Pathways or Reactions. Pathways in Reactome are multi-step events, whereas Reactions are single-step events (at the molecular or atomic level). Events may be linked to other Events that precede them, regulate or are regulated by them. Reactions contain PhysicalEntities that take part in the Event. Topics seen on the Reactome front page are broad biological concepts that usually encompass other Pathways as components.

Figure 1 – The Reactome data model robustly represents biology. The central concept in Reactome is the reaction, which is used together with pathways, macromolecules, small molecules, complexes, and catalyst activities to represent biological processes. The reaction itself is a single step biological event in which input entities are converted to output entities.

Events are the conversion of input PhysicalEntities to output PhysicalEntities. These are the building blocks used in Reactome to represent all biological processes. At present, only two subclasses of Event are recognized, Reaction and Pathway. A Reaction is an event that converts inputs to outputs in a single step. A pathway is any grouping of related events. An event may be a member of more than one pathway.

The Pathway class is thus remarkably heterogeneous at present. Work is underway to extend this aspect of the Reactome data model to support distinctions between conventional pathways, e.g., "fatty acyl CoA biosynthesis", and other useful groupings of events, e.g., "carbohydrate metabolism", "hydroxylation of xenobiotics", or "Cell cycle progression".

PhysicalEntities can be single entities, such as proteins, small molecules, RNA, DNA, carbohydrates, lipids, or sub-atomic particles. They can also be complexes consisting of a combination of any of the single entities, or polymers synthesized from the single entities. Related entities can be grouped into a set. PhysicalEntities can be the inputs, outputs, catalysts, regulators, or requirements in Reactions.

Other annotations in Reactome are linked to Events or PhysicalEntities, or to properties of these. For example, a literatureReference is linked to an Event, and the authors of this paper are linked to the literatureReference as instances of another class: Person. A Summation is a text description linked to an Event, and a literatureReference may support this Summation. In the latter case, the literatureReference has an indirect link to the Event itself. These relationships will be evident as a you encounter pathways and reactions in the Curator Tool.

For a more detailed description of the Reactome data model, see the attached data model glossary.

Downloading, installing, and maintaining the curator tool

The curator tool is a separate program that you need to install on your own computer, and that you run locally to create new Reactome instances and edit existing ones. Features of the program allow you to retrieve existing instances from the central Reactome database, create new instances of physical entities and events, perform QA checks on instances, and enter finished material into the central database. For all Reactome downloads go here. (The download page also has a link to a development version of the Java web-start curator tool which should not be used for routine curation yet.) You may also choose the appropriate version of the curator tool from here:

Platform Download Page
Win.gif Windows Available
Macosx.gif MacOSX Available
Linux.gif Linux Available

The curator tool requires a current install of the Java Virtual Machine on your computer. Current machines should have the Java Virtual Machine already installed. If you are using an older machine that lacks it, please go here to find the latest Java Runtime Environment (JRE) Update for your platform (a free download). If this fails, contact Guanming Wu for help.

The curator tool is periodically upgraded in two ways: the program itself is revised to fix bugs or add new features, and the schema - a file used by the curator tool to define the attributes of the various kinds of instances and relationships among them and installed along with the curator tool on your computer - is revised. Both kinds of upgrades are generally discussed in advance and announced on the Reactome-dev mailing list. Also, when a new version of the tool is available, you will be prompted to download it when you turn on the curator tool. The download generally proceeds very fast. It may require exiting and re-starting the curator tool but existing curator tool files should be editable with the new version of the tool.

Figure 2 – Select "update schema from DB" from the Tools menu

To update your schema file, choose the "tools" item on the top menu bar, and choose the "update schema from DB" on the drop-down menu, Figure 2. Updating is quick and automatic, and the changes take effect without a re-start of the curator tool. However, some schema changes can cause existing curator tool project files to become incompatible with the central database. Be sure that you understand the implications of a schema change for your projects when it is discussed by the software group!

Author Input

The most desirable form of expert input is a project created using the Reactome Author Tool. An Author Tool project file (with a .gkb file extension), can be imported into the Curator Tool, and manipulated by a curator. However, authors may provide input as PowerPoint slides, text summations, email, or cryptic handwritten notes. In whatever form the curator receives expert input, it is the curators job to migrate the data into Reactome. Generally this is a stepwise procedure that may require the curator to pre-assemble the Pathways, Reactions, and PhysicalEntities that describe a biological pathway before meeting with the expert. Subsequent refinements will allow the expert and the curator to transmute the biological pathway into a logical Reactome representation.

Assisting the Expert Author with the AuthorTool

The expert author will often be new to the AuthorTool so it is a good idea to do as much as possible ahead of time to assist the author. Remember that during a jamboree or initial meeting the author is learning the structure of the Reactome database, how to use a new piece of software, and how to fit their expertise into the these new concepts. A number of steps can be taken ahead of time to ease this transition. The author needs a solid introduction to Reactome and ideally this introduction should be made using a simple pathway populated with some of the reactions and entities the author will be working with. Much of the initial meeting will be spent casting the experts idea of the biological pathways into the Reactome data model. Do not get disappointed, any ground gained in a face-to-face meeting familiarizing the expert with the steps you will take with their expertise will save much time later.

General Classification of Events and Entities

When starting a new module (a set of reactions and pathways that make up a biological pathway) the curator must make decisions imposing a logical topography on the pathways and reactions within the module. This first step often seems somewhat arbitrary, but it is the establishment of an approximate scaffold that reflects the current state of what is known in the field that is important. Use the following standard procedures to guide your decisions in structuring the topography.

Choosing between Pathway and Reaction

A reaction event is the basic unit of Reactome – an event in which something happens to input physical entities, converting them to output entities in a single step, possibly involving other physical entities as catalysts or regulators. A pathway event is a grouping of two or more other events which can themselves be reactions or pathways.


Correspondence of Reactome classes to GO

The Gene Ontology (GO) consortium provides three controlled vocabularies, "Biological Process", "Molecular Function" and "Cellular Component". These terms allow users to identify equivalent data objects between independent model organism databases and integrative knowledgebases like Reactome. Reactome uses all three GO ontologies; PhysicalEntities are assigned locations from "cellular compartment", the Activities of proteins and complexes are taken from "Molecular Function", and Pathways are assigned terms from "Biological Process". Each GO term used in Reactome is hyperlinked to its definition on the GO consortium website. If a GO term exists in the GO hierarchies but is not present in gk_central, you can create an instance (GO_Biological Process, GO_Molecular Function, or GO_Cellular Component) term as long as you make sure the accession and reference database are correctly entered and the correct GO class is used AND inform the editorial release manager so that it can be checked after the GO update.

Curator Tool Operation and Layout

The primary tool of curation the Curator Tool, a Java based tool for use by curators to annotate biological pathways based on the Reactome schema. Though it may initially seem daunting, there are actually just a few steps that will get you up and going relatively quickly. It is important to note that if you are reading this as an outside curatorial group that has set-up your own local version of the Reactome dataset, some of the connection specifics may be different.

To get a feel of how the Reactome Curator Tool works download a copy from the Reactome site and immediately go to the help menu. Read through the help section for a basic introduction to the available views of the data, Hierarchical vs. Schema based, and the basic operation of the tool. A curator tool tutorial is under development; in the meantime, direct any questions to help@reactome.org.

Creating a new instance

To create a new instance of a given class, click to highlight the class, and then either right click the mouse or use the create new instance button on the toolbar on top. This opens a dialog box that allows a curator to annotate this new instance. A new instance can also be created within the dialog for adding a property.

Adding, editing or removing a property

To add, edit or remove a property of an instance, right click the selected instance. A property can also be edited by double-clicking it; this opens a dialog box that allows the curator to check the details, and edit as appropriate.

Cloning an instance

A curator may want to create an Event or a Complex that shares many of the attributes of an existing instance of the same class. To do this, a curator can right click on it and choose the clone instance option from the drop-down menu. Once it is cloned, the curator should be particularly careful to operate on the "cloned" instance denoted by its negative, ("-") DB_ID.

Database Browser

The database browser allows a curator to browse the central repository, and check out instances into their local repositories. In the Schema view the whole Schema structure is displayed, and events can be checked out as individual instances. Referrers of the selected instance/s are checked out only as shell instances or links, i.e. all the properties of a referrer instance are not checked out into the local repository.

The browser is interacting with the MySQL database, so the underscore character _ is a wildcard. To search for a name that contains an underscore character, precede it with a backslash: \_. For example, a search for summations whose text contains _ returns all the summation instances in gk_central; a search for ones whose text contains \_ returns the two summation statements whose texts in fact include an underscore character.

The Schema view is also exceptionally useful for searching Reactome. When preparing a new module time spent upfront, searching for PhysicalEntities and Reactions that already exist within reactome, can save a great deal of time later when weaving the completed module back into the reaction web.

In the Event view, the Event tree is displayed. Checking out an Event using this utility, checks out the Event itself and all its referrers into the local repository.

Beginning to Curate

When you begin a new project, inform the curation coordinator, so that s/he can add your project to the frontpage if appropriate, and also keep track of projects going into the central repository.

Beginning a new Pathway project

To create a new project using the Curator Tool, a curator can use the database browser to check out entities or reactions to connect the new project to existing events in the central data repository. The curator can create Pathways and Reactions, and connect them using the "preceding event" slot on events, and annotate more details as described below in the Curation section (Section 6).

Affiliation

By default, all projects in Reactome describe human biology and, when finished, will be displayed as part of the main Reactome web site. If your project is one of these, you can skip this section.

If, however, your pathway is part of a model organism project like Gallus Reactome, you need to identify it as such using the optional 'project' slot value of your person instance. Create a person instance for yourself with your first name, surname, first initial, and affiliation as usual. Also fill in the 'project' slot with an appropriate name (e.g., FlyBase, Gallus Reactome). If multiple curators are all contributing to a single model organism pathway project, they must all use the same project name. If you are working on projects for more than one species, you should have a separate person instance for each. These person instances would differ only in their 'project' slot values. Again, no 'project' slot value means that the project created by that person is part of human Reactome.

This slot value is used to assign ownership of proteins, complexes, and events to projects - FlyBase, Gallus Reactome, etc., and to generate project-specific slices for release, even as all of the events and entities co-exist in gk_central. Any event or entity not explicitly claimed in this way belongs by default to human Reactome.

Editing an existing Pathway in the central data repository

To edit an existing pathway in the central data repository, use the Database Browser to find and check out the Pathway, or a set of individual instances depending on the level of editing required.

Editing an Author Tool project

To edit an existing Author Tool project, import the .gkb file created using the author tool. This should give the curator a basic framework for a given pathway, and hopefully good text descriptions of events with literature references that provide evidence for the events. A number of attributes may be missing, and can only be added to a pathway using the curator tool. The top-level pathway (container pathway), may be checked in directly into the central data repository, or may need to be integrated into an existing pathway in the central data repository.

The function of the Author Tool is for a biology expert to share their insight into the biology of a given topic. Their input need not conform to the Reactome data model. It is the curators responsibility to ensure such conformance of data to the Reactome data model. It is important that you pay attention to the species that the author has used for the PhysicalEntities, Pathways, Reactions, and LiteratureReferences. Reactome does not mix species data and the curator should keep a weather eye on the integrity of the data within reactions. For example, the author may have ascribed sequential events to different species, based on the experimental evidence available. The curator needs to make sure that this is sorted out, and any non-human reactions are only used for deducing human reactions in a given "human" pathway, and not linked to reactions in human pathway. This of course is not true when dealing with host-pathogen interactions.

Curation

This section describes the curation process, which involves fitting biological data into the Reactome data model. many aspects of the model are described here, but alternative, sometimes more complete, descriptions are also available in the form of a glossary.

Creating text

At many points in the process, you will be creating text, such as names of physical entities and reactions, as well as summation statements. Simple ASCII text gives the most reliable results. In particular,

1. Avoid HTML mark-up. There are many places where our web code cannot handle it. The only place where HTML tags are known to work reliably is within the text of summation instances, and even there only four tags -
(hard return),

(new paragraph), (italic font), and (bold font) are known to work reliably. All other tagging is strongly discouraged (except that additional tags you've already put into material that has been successfully released and that you know displays properly on several browsers do not need to be removed).
2. Avoid non-ASCII characters. They are supposed to be handled properly by our web code and by current generation web browsers, but this is not reliably true and our public site continues to display a small proportion of funny characters instead of the intended diacritical marks, inverted quotation marks and the like.
3. Avoid copy - paste operations from files generated by programs like Word. Such text is a rich source of hidden mark-up tags and funny characters. Composing text in Word is often really useful. To sanitize such text, copy it into a simple text editor (Windows Notepad, Mac Simple Text, etc.), then copy it from there into the curator tool. That does a really good job of stripping away the unwanted stuff and leaves text that behaves in a predictable way.

Creating Events

Top Level Event

A top-level event for a given topic will almost always be a conceptual pathway that is then populated with Pathway events. Break down the top-level pathway into its components to create tiered events. For example, if DNA Replication is the top-level event, then DNA Replication pre-initiation, DNA Replication initiation, and DNA Strand elongation, are level 1 events. Continue to break down each of the three components of DNA Replication into their component events. In this example, DNA Replication is a Pathway, as are DNA Replication pre-initiation, DNA Replication initiation, and DNA Strand elongation.

Pathway

To create a new pathway, right click on Pathway in the class tree in the left panel, or choose the create instance button on the top toolbar. This pops up a dialog box for creating a new instance, and allows the curator to add properties to the Pathway. Fill in the value slots for properties as described below; note that properties that are the defining attributes appear at the top in gray, and the rest of the properties appear in alphabetical order with respect to "property name".

hasEvent

by definition a Pathway is an Event that has component Events. Right click in the value slot of hasEvent and choose "add" from the menu to add an event as a component.

DB_ID

is uneditable. Allows the curator to check this attribute for searches etc.

displayName

are both internal variables, not for curator use.

doRelease

this attribute is set to false by default. Change it to true only when a pathway is ready for release.

compartment

right click to choose the appropriate compartment.

created

right click and add an existing InstanceEdit instance to add an author, or choose an existing InstanceEdit as appropriate.

crossReference
evidenceType

used by the orthology script to denote electronic inference. Not for curator use.||

Figure

add the url for the figure here as "/figures/xxx.jpg". Also, check into cvs a figure with the identical name xxxx.jpg in the Reactome/website/images/ directory.

goBiologicalProcess

right click to choose proper GO cross reference.

inferredfrom
literatureReference

important attribute to be filled in. For a Pathway, this should be a comprehensive, trusted review article.

modified

uneditable. Not for use by curators. Used to track edits to the instance.

name

give the Pathway a unique name. Uniqueness is not required, but is requested for the convenience of curators (yourself and others).

orthologousEvent
precedingEvent

Either choose a Pathway or Reaction that occurs before this Event, or create new ones as appropriate.

Summation

right click to create a summation; ideally a curator would be adding the text description for a Pathway that was submitted by an author.

In all these cases, check out the appropriate instances of species, biological_processes, and cellular_components from gk_central to have them available in your local project. (These instances can be checked out at any time – it is not necessary to anticipate all needed material before starting to create and annotate new entities and events. All data instances in Reactome have defining attributes – one or more specific features used computationally to identify and distinguish individual instances of a class from one another. The defining attributes are listed first and highlighted by gray shading in the forms generated in the curator tool. These slots should normally be filled out for all instances you create, although leaving them empty will not automatically corrupt the database.

Reaction

To create a new reaction, right click on Reaction in the class tree in the left panel, or choose the create instance button on the top toolbar. This pops up a dialog box for creating a new instance, and allows the curator to add attributes to the Reaction. Fill in the slots as described below; note that the defining attributes appear at the top in gray, and the rest of the attributes appear in alphabetical name with respect to "property name". Note that several properties are the same ones in Pathway, and are not described again here. The properties that only occur on reactions are described below.

catalystActivity

A reaction may have a CatalystActivity associated with it. Right click the catalystActivity value slot, and choose add. A new dialog box for catalystActivity appears. Either choose an instance in the list\* or create a new instance, or browse the gk_central DB for an already existing instance that has not yet been checked out.

input

at least one input is necessary.

output

at least one output is necessary.

precedingEvent

Usually, choose a reaction for this slot. Choose a Pathway in the rare case where the preceding event happens to be a Pathway whose components have not been, or cannot be, annotated.

The "requiredInputComponent" slot is used to indicate entities or domains (more often the latter) that are found within the input entities. This slot is used, like the "activeUnit" slot on CatalystActivity, to indicate which part/feature of the entity actually does the job, e.g. a binding domain that mediates Complex formation. It is NOT meant to describe additional entities that are required for the Reaction.

Creating Physical Entities

PhysicalEntities in Reactome can be SimpleEntities or Complexes. Complexes are made up of more than one PhysicalEntity component, whereas SimpleEntities do not have components. Small molecules, sub-atomic SimpleEntities, and proteins without an accession number are ConcreteSimpleEntities. If a SimpleEntity that is protein, RNA or DNA has an Accession number assigned to it is a SequenceEntity. A fragment of a SequencedEntity is a SequencedEntityFragment. SimpleEntities without an accession number, and those that can be "one out of a set of possibilities" are SimpleEntities.

Properties on top in gray are the defining attributes.

Complex

hasComponent

add physicalEntities as components. Protein monomers (entityWithAccessionedSequence), simple entities, and other complexes are all permissible components.

compartment

choose an entity compartment.

goCellularComponent

enter the GO cellular_component term for the complex here

hasDomain

add a domain here. Dont know why we need three types of domains. This question has not been answered for over 4 months.

inferredFrom

for use by the ortho script. Not for curator use.

entityOnOtherCell

enter the part of complex that is coming from another cell

summation

Could be used for a free-text summary of features of the complex, but usually left blank. Material stored here is not displayed on the Reactome website, so comments about a complex that might be useful to Reactome users should be incorporated into the summation(s) attached to the event(s) in which the complex participates.

Physical entities may only be localized to one of the non-overlapping cellular compartments listed in the entityCompartment section of gk_central.

Entity Set

The EntitySet class of entities can be a CandidateSet, a DefinedSet, or an OpenSet.

CandidateSets and DefinedSets always include a defined number of individual entities for which we want to make a statement. This doesnt mean that the statement may not be true for a greater number of individual entities, and such entities may be added in the future, but for the time being we make a statement only for the entities we choose to include, based on our present knowledge or based on practical considerations.

The distinction between the CandidateSet and the DefinedSet is that the members of a DefinedSet are all known to be involved, while the CandidateSet includes entities whose participation is assumed to be true, say based on an experiment that cannot clearly distinguish between two or more proteins. So there is some evidence for their participation, but not an ultimate confirmation as to which one of the candidates. In addition to these entities, which would be entered in the hasCandidate slot, a CandidateSet may or may not include confirmed entities in its hasMember slot.

    • the members or candidates of a set MUST belong to the same class...e.g all EWASs or all simpleEntities or all Complexes**

OpenSet is a set of entities that are characterized by a common feature, given in the referenceEntity slot, but not all such entities can reasonably be counted. For instance, alcohols all have an -OH group in common, but its impossible to include all alcohols individually in an EntitySet.

Generally if you want to form a set from a list of proteins with UniProt accession numbers, your decision will be between a CandidateSet and a DefinedSet, but not an OpenSet.

CandidateSet

hasCandidate

add a PhysicalEntities as components

hasMember

add a PhysicalEntities as components

compartment

choose an entity compartment.

goCellularComponent

???

hasDomain

add a domain here. Dont know why we need three types of domains. This question has not answered for over 4 months.

inferredFrom

for use by the ortho script. Not for curator use.

summation

Could be used for a free-text summary of features of the set, but usually left blank

Physical entities may only be localized to one of the non-overlapping cellular compartments listed in the entityCompartment section of gk_central.

DefinedSet

hasMember

add a PhysicalEntities as components

compartment

choose an entity compartment.

goCellularComponent

???

hasDomain

add a domain here. Dont know why we need three types of domains. This question has not answered for over 4 months.

inferredFrom

for use by the ortho script. Not for curator use.

summation

Could be used for a free-text summary of features of the set, but usually left blank

OpenSet

hasMember

add a PhysicalEntities as components

referenceEntity

ReferenceMoleculeClass as component

compartment

choose an entity compartment.

goCellularComponent

???

hasDomain

add a domain here. Dont know why we need three types of domains. This question has not answered for over 4 months.

inferredFrom

for use by the ortho script. Not for curator use.

summation

Could be used for a free-text summary of features of the set, but usually left blank.

EntityWithAccessionedSequence

compartment

– add an entity compartment

endcoordinate

– add

hasModifiedResidue

– add a modified residue. Used to annotate post-translational modifications of proteins.

referenceEntity

– add a referenceSequence

startCoordinate

– add

figure

– does not belong here. Do not use.

Polymer

repeatedUnit

– add a PhysicalEntities as components

compartment

add a GO cellular component, entity compartment here. xxxxx explain about non-overlapping compartment set.

goCellularComponent

???

hasDomain

add a domain here. Dont know why we need three types of domains. This question has not been answered for over 4 months.

inferredFrom

for use by the ortho script. Not for curator use.

summation

Could be used for a free-text summary of features of the polymer, but usually left blank

OtherEntity

name

– add a name

compartment

add a GO cellular component, entity compartment here. xxxxx explain about non-overlapping compartment set.

goCellularComponent

???

hasDomain

add a domain here. Dont know why we need three types of domains. This question has not answered for over 4 months.

inferredFrom

for use by the ortho script. Not for curator use.

summation

Could be used for a free-text summary of features of the entity, but usually left blank

SimpleEntity

compartment

add an entity compartment

name

add a name. Surprise, this is a defining attribute, and not ReferenceEntity

species

right click to add a species.

hasDomain

not sure this belongs here.

ReferenceEntity

Add a ReferenceMolecule. Choose one from the list or create a new instance.

summation

Could be used for a free-text summary of features of the entity, but usually left blank

Creating Modified Residues

The modifiedResidue class has been substantially reorganized and expanded to align Reactome practice with the PSI-MOD standard and to allow the annotation of variant amino acid residues in proteins in addition to chemically modified ones. The details are here.

coordinate

Enter the number of the residue that is modified. For "phosphorylation of serine 396 of Myocyte-specific enhancer factor 2C", 396 would be the value. If this number is unknown, leave the value slot empty.

referenceSequence

add the ReferenceSequence of the protein being modified

DB_ID

Do not edit - automatically filled in when the instance is submitted to the central database. This value is the unique identifier used for internal tracking of the instance - manually editing it is likely to corrupt the central database!

_displayName

Do not edit - automatically filled in by the Curator Tool.

created

Do not edit - automatically filled in when the instance is submitted to the central database. This value is used internally, with the modified slot value, to track the history of the instance.

modified

Do not edit - automatically filled in when the instance is submitted to the central database. This value is used internally, with the created slot value, to track the history of the instance.

psiMOD

Look in the psiMod class on gk_central for the instances you need, check them out into your local project and use them to fill the psiMod slots of your modifiedResidue instances. Note that a psiMod instance describes both the amino acid residue and its modification, e.g., phosphoserine. There are already psiMod instances for phosphoserine, phosphothreonine, and phosphotyrosine and about 30 others. If you don't see the particular psiMod instance you need, let Peter know.

Creating Regulations

To add a regulator or requirement for an Event, a curator must create an instance in the regulation class, and choose a catalystActivity or an Event to link it to.

Creating Summations

A Summation may be created of an Event, and can then be linked to it, or it can be created by the double-clicking on the value slot of the property summation on an Event creation dialog box.

text

add a text description of the Event. To add special effects such as italics or boldness to text, or to add paragraphs, a curator needs to use HTML tags. For example, do not put hard returns in this text block, they will appear as /n on the website. Instead use "<p>".

Creating LiteratureReferences

pubMedIdentifier

double-click on the value slot, and type in a PMID. This should give the curator a pop-up box, asking if the information should be fetched from the pubmed website. Click yes.

Reusing instance ids:

If I modify an instance do I need to change its identifier? A crude rule-of-thumb would be that inter-deep-branch, i.e. when the root class is the nearest common superclass, reclassification shouldnt happen. Moving from, say, OpenSet to DefinedSet, or from OtherEntity to EWAS etc etc are all the kind of moves which, depending on the situation, can be sensible, valid and good. Do not change classes of instance just to "recycle" (i.e: if you have an unnecessary reaction and you need to create a complex, dont just change the class of this instance from reaction to complex). Basically, when you edit an instance, you are saying that this instance has changed. While, if you delete an instance, the meaning is that it no longer exists even though you may have put another, even exactly identical, instance into its place.

Post-Curation Steps

After the pathways, reactions, inputs, outputs, summations, references and all of the other minutiae that are required have been entered for your module there are a number of post curation steps that will be required before review. Figures that you wish to be displayed for pathways and reactions must be committed, your constellation of reactions must be added to Reactomes "sky", and you must catch and correct any errors that may have worked their way into your module.

Adding Figures

Figures that illustrate individual reactions or pathways can be displayed on the Reactome web site. A number of formats are supported including, .jpg, .png, and .gif. The figure should be small enough to fit on an average screen, but there is no set standard (though we may wish to change this). Figures may come directly from the author, from a publication from which you have secured copyright permissions, or created by you or the art department. Once you have worked out your figure you must move it onto the Reactome development site. First use sftp to transfer your figure to the GKB/website/images directory. Then use cvs to commit your file to the repository by using your account on the Reactome server (brie8.cshl.org at the time of this writing for Reactome curators).

If this is the first time you have committed this file:

%>cvs add filename.jpg

%>cvs commit filename.jpg

If you update this file in the future you merely need to recommit the file after you overwrite the old figure:

%>cvs commit filename.jpg

Now you must change to the directory that the Reactome development site (DEV) is served from:

%> cd /usr/local/gkbdev/website/images

Then update the figures in this directory

%> cvs update

All of the new figures that have been committed to the GKB/website/images directory will be added (or updated). Go to the DEV site to see if your figure is indeed displaying as you hoped.

Adding Reactions to the Sky

When its time for events to be added to the sky, e-mail the Reactome astronomer, currently Peter DEustachio, an exact list of the events to be added (e.g., "all the components events of Pathway 123456 plus Event 143265 and Event 143627").

If any obsolete events should be removed from the sky, identify these also. (E.g., if there was already an event in the sky, perhaps created in an earlier stage of the annotation project to serve as a placeholder, which has now been divided into multiple ones, the old one should be removed and replaced.)

Someday, when we agree on a single curatorial approach to organizing events into hierarchies and the _doNotRelease tag is fully and uniformly used, it should be possible to automate part or all of these two steps. For now, it must be done manually: if you dont identify an event in an e-mail to the astronomer, it doesnt get mapped to the sky!

Remember also that the sky is not redundant: if a single-step event is in the sky, its multi-step parent cannot be. At present, conflicts like this must be resolved by mapping the child and omitting the parent. Someday, it may be possible to be more flexible about this.

If you have an opinion about how the events should be arranged in the sky - what other constellations they should be near or how the events in the new constellation should be organized – tell the astronomer. A sketch of the layout, or a pointer to a figure or summation that shows the layout would be handy. This is especially important if you have a strong opinion or the correct order is not clear from annotated preceding-following relationships among the events.

Quality Assurance

All curators are expected to make mistakes at some time or another. Reactome curators are supported by a number of QA checks, some run at the level of the curator tool, others run as scripts at the time that data is checked into the database, and some over the entire database. The QA scripts are intended to identify real mistakes in the database, not to dictate curation principles. From time to time curation principles are reevaluated, and if curation procedures change QA scripts are adjusted and often used to insure the consistent application of the new procedure throughout the entire dataset.

As such QA scripts are constantly being evaluated. The QA script may identify something as a mistake, which turns out not to be a problem according to the way we want to do curation. When this happens, we need to look for an algorithm / a principle to exclude this particular scenario from the list of mistakes during the next round of running the script.

Imagine this scenario as a false positive case flagged by a script that the human user can immediately identify as okay, but it may be hard to find a formal way to teach the computer. In such a situation, we may consider to continue with the script as before and ignore these obvious false positives.

Do not be worried about knocking the QA script over, and just for that reason do curation differently from what should be done. The better way is to let us know that a mistake identified by the QA script is in fact a false positive.

Also, nobody should feel embarrassed to end up on the list when the QA scripts are run. As we all know there have been data corruptions that have been caused by something completely different, unknown to the person who created the instance in the first place. And of course, nobody can be expected to never make any mistake. So the names on the lists produced by the QA script are there to indicate the best person to have a look at it (or a potential person in case the name is preceded by a ?, when the created slot is empty), but not at all to put anyone to shame!

Similarly, work in progress is absolutely fine. You may end up on the list, but if you know its because of work in progress there is no need to rush into action just because of the list. You may just take it as a reminder that this needs to be addressed at some point before it will be released.


QA on \*gk_central\* is meant to inform you of problems and may help to identify problems at an early stage when you are into the subject and therefore in a better position to know the right answers. But as stated above, you may fix the problems at your own pace. (There may be some exceptions to this, e.g. mistakes interfering with a Uniprot update or similar, but then I would point that out.)

QA on the \*slice\* on the other hand, should be taken very seriously, and addressed promptly. It seems like a good idea to take a preliminary slice a week or two before the real one is taken (as we have done this time), to allow this kind of QA to happen. This will speed up the process from the time point of the data freeze to the release itself.

Imbalance Script

Looks at proteins in reactions, working on the assumption that proteins that go into the reaction, somehow have to come out of the reaction as well - be it in a modified form, within a complex, or in another compartment, the ReferencePeptideSequence should be there both in input and in output. This script is especially helpful in checking accuracy of events involving multisubunit complexes. An event in which a protein is synthesized or degraded can also be flagged by the script. Inform the QA manager, currently Esther Schmidt, of such instances and the script will be corrected to allow them.

Viewing instances in the instancebrowser view after implementing the ELV

http://reactome.org/cgi-bin/instancebrowser?DB=test_reactome_33&ID=109581

...where you put the DB_ID of the instance of interest after the "ID=". set DB to appropriate releaseDB

Reactome Mini-Glossary

Included here is the Reactome Mini-Glossary, or the minimum that you need to get started. What the glossary lacks in specifics is made up for in speed. As you come across more complex items ask help@reactome.org, and add to your curatorial knowledge.

Event

This is an abstract class which should not be instantiated directly. That is, in the formal hierarchical structure of the database, the event classes actually used for annotation (e.g. pathway and reaction) are children of this abstract event class. This arrangement allows shared features of all kinds of events to be maintained reliably and consistently. Programmers change the features of the event class to implement a change in the Reactome data model, but curators should never do anything to this class, nor attempt to create instances of it.

Reaction

The conversion of input entities to output entities, possibly facilitated by a catalyst or modulated by regulatory entities or events. Conversions can include binding and complex formation, biochemical reactions, transportation of entities between compartments, and signal transduction. By using an entity set as input, output, or catalyst, one can conveniently capture facts like the ability of a catalyst to convert any member of a family of small-molecule substrates to products or of a transport protein to mediate the passage of any member of a family of small molecules across a lipid bilayer membrane.

BlackBoxEvent

Events whose molecular details are not spelled out.

The hasMember and hasEvent slots are used to link the blackbox event to ones that are examples of it and to ones that are parts of it, respectively.

The BlackBoxEvent connexin synthesis annotates the transcription and translation of any member of the connexin gene family to yield the corresponding protein. Transcription and translation of three specific connexins has also been annotated to capture the specific roles of these proteins in gap junction assembly. These three events are thus specific examples of the more general process of connexin biosynthesis (is_a children, in GO terms) and the relationship is captured by making the specific events hasMember slot values of the general one.

The transformation of a small discoidal HDL (high-density lipoprotein) particle into a spherical one involves reactions in which the discoidal particle binds additional lipid molecules, bound lipids are covalently modified, and additional proteins are bound. These reactions appear not to occur in a fixed order nor with a fixed stoichiometry, so the transformation of discoidal particles into spherical ones cannot be annotated as a pathway. Instead, the specific reactions that make up this process were annotated (e.g., cholesterol + phosphatidylcholine (lecithin) => cholesterol ester + 2-lysophosphatidylcholine (lysolecithin), a blackbox event, transformation of small nascent HDL to spherical HDL was created, and the specific reactions were linked to the overall process by making them hasEvent slot values of the process (has_a children, in GO terms).

EquivalentEventSet (OBSOLETE)

The equivalentEventSet class was created to annotate groups of events that accomplish the same thing, e.g. a reaction of intermediary metabolism that can be catalyzed by any one of several tissue-specific isozymes. Creating a definedSet of the isozymes and using that as the physicalEntity slot value of a single catalystActivity instance linked to a single reaction instance is a better way to annotate this information, so the equivalentEventSet class has been retired.

Pathway

A set of Events which are linked by shared output/input and/or output/catalyst entities or indirectly via Regulation instances and is in some way recognized as a functional unit. I know, this is horribly vague and should be tightened since it allows pretty much everything. It would be nice to come up with a definition which would allow the Pathways to be actually useful. Anyway, the slot to be filled with components is hasComponent.

PhysicalEntity

Like Event (section), PhysicalEntity is an abstract class used to organize and manage the Reactome data structure. Curators should not modify or attempt to create instances of it.

SimpleEntity

A molecule whose exact atomic structure is known and that is not encoded in the genome (formerly, ConcreteSimpleEntity). ATP, glutathione, and ethanol are examples. Insulin and specific tRNA molecules are not, because they are directly or indirectly encoded. Alcohol and dNTP are not, because they are classes of related molecules, not single fully specified ones..

GenomeEncodedEntity

Some former GenericSimpleEntities. For things such as un-sequenced proteins which belong to a given species and hence should not be assumed to be present in other species (important for orthology inference).

EntityWithAccessionedSequence

Individual molecule with (known) sequence which is in the sequence database. Typically a protein or RNA or fragments of them Default start and end coordinates for the entity having the full-length sequence of ReferenceSequence are 1 and -1, respectively. (-1 is a Perl:ism which means the last element of an array, i.e. here we mean the last residue). For fragments, the start and end coordinates should be filled in accordingly. Use 0 if the coordinate is not known.

What kinds of evidence, and how much, is needed to justify creation of an EntityWithAccessionedSequence instance? In particular, the TrEMBL section of UniProt contains entries for many human proteins known only as inferences from predicted gene models: perhaps a EST corresponding to the predicted mRNA has been observed but there is no experimental confirmation that a protein is made. Many predicted isoforms due to alternative splicing similarly lack experimental confirmation. As a matter of editorial policy, a protein is considered to be “Reactome-annotatable” if the consensus of expert opinion is that the protein actually gets made in some cell. In the case of transcription factor families, for example, essentially all family members pass this test even though considerable uncertainty may still exist as to the identity of the exact family member or members responsible for regulation of a particular transcription event in a particular cell type at a particular developmental stage.

Complex

Former ConcreteComplexes and GenericComplexes with hasComponent values. Must have components. Things which are suspected or "known" to be complexes but for which the components are nevertheless unknown will not belong here.

EntitySet

An abstract superclass for DefinedSet, OpenSet and CandidateSet. Should not be instantiated directly. An entity set is a collection of structurally related PhysicalEntities (molecules or complexes) which function interchangeably in a given situation. "Structurally related" means, for example, protein isoforms that all have the same catalytic or ligand-binding activity, or nucleotide monophosphates that are all equally readily converted to the corresponding diphosphates by a particular kinase. "Function interchangeably" is defined by the limits of the experiment used to characterize the biological system to date. These collections are created as a matter of convenience to minimize curation work and to prevent combinatorial explosion. The physical entities in any one set members MUST all belong to the same class, e.g. all EWASs or all simpleEntities or all Complexes. (This rule is editorial policy - it is not a requirement of the data model.)

DefinedSet

A group of physical entities all of which have been shown experimentally to perform the function being annotated. The group must have at least two members (two hasInstances slot values).

Examples:

-Cdk 4/6 (with Cdk4 and Cdk6 as values of hasMember slot).
-NTP (with ATP, GTP, CTP and UTP as values of hasMember slot).

Comment:

If a DefinedSet of substrates (e.g., NDP (nucleotide diphosphates)) is used as input for a reaction and another defined set (e.g., NTP (nucleotide triphosphates)) is used as output, the annotation is taken to mean that the first member of the input set is converted to the first member of the output set and so on. Thus, input set (CDP, UDP) and output set (CTP, UTP) means that CDP is converted to CTP and UDP to UTP. Input (CDP, UDP) and output (UTP, CTP), however, means that CDP is converted to UTP and UDP to CTP, a very different reaction!

If two sets are used within the input:

Set1 + Set1 ------------> Set1:Set1 or Set1 + Set2 ------------> Set1:Set2,

it implies all the combinations are possible between the members within each set. Whenever possible use EntitySets as far down as possible (prefer to use ‘complex of sets’ rather than ‘set of complexes’).

Exception: if only certain combinations of entities do indeed form complexes, then create those individual complexes and you may create a ‘set of complexes’ if appropriate.

Be aware of stochiometry issues when using Sets. It's easy to confuse them with complexes, but a Set only represents one entity at a time, as specified by its members.

Here is an example of a rather atypical use of Sets in a BlackBoxEvent: Classical PDGF cleavage is combinatorially complicated as it involves a set of homo- and hetero-dimers as well as various fragments. Given that not all fragments are functional we represent this process as a BlackBoxEvent, ignoring non-functional fragments. Here we also allow different numbers of set members in input vs output.

Classical unprocessed PDGF                             Processed PDGF

|PDGF-A1:PDGF-A1|					|PDGF-A’1:PDGF-A’1|
|---------------|					|-----------------|
|PDGF-A2:PDGF-A2|					|PDGF-A’2:PDGF-A’2|
|---------------|		------------->	        |-----------------|
|PDGF-B:PDGF-B  |					|PDGF-Bs:PDGF-Bs  |
|---------------|					|-----------------|
|PDGF-A2:PDGF-B |					|PDGF-Bl:PDGF-Bl  |
							|-----------------|
							|PDGF-A’2:PDGF-Bs |

PDGF-A1: PDGF isoform-1; PDGF-A2: PDGF isoform-2 PDGF-A’1:PDGF processed isoform-1; PDGF-A’2: processed PDGF isoform-2 PDGF-Bs: processed PDGF B short form; PDGF- Bl: processed PDGF B long form

CandidateSet

A group of physical entities one or more of which is known to perform the function being annotated. This situation arises when experimental data firmly link a function to a protein family but only some (or none) of the family members have been characterized individually. The hasMember slot is used to indicate the set members definitely known to perform the given function. Other set members that are not definitively characterized are represented in the hasCandidate slot. The hasCandidate list need not be exhaustive.

Examples:

"Cyclin A1 but possibly also A2 and/or A3" would be described as
-hasMember: Cyclin A1
-hasCandidate: Cyclin A2, Cyclin A3
"A Cyclin A family member but we don't know which one" would be described as
-hasCandidate: Cyclin A1, Cyclin A2, Cyclin A3

OpenSet

A group of entities that is countable in principle, but not in practice, typically classes of molecules such as RNA, mRNA, lipid. Can have examples attached as values of hasMember slot. The examples given are not an exhaustive list of the entities that are represented in this set.

Example:

-poly-A-containing mRNA that is capable of circularizing, which has ceruloplasmin mRNA as a representative.

Polymer

- Used to represent complexes for which the stochiometry is unknown or not fixed and complex molecules, such as polymers, which consist of repeated units and which we cant describe otherwise. Has to have 1 or more repeated units (in repeatedUnit slot) which can be any PhysicalEntity. More than one repeated unit means that the units dont have to be present in equal quantities. If the units are present in equal quantities, form a complex first and use this as the repeated unit. Unit count range can be specified with minUnitCount and maxUnitCountslot.

Examples:

-glycogen with glucose as repeatedUnit.
-fibrin multimer with fibrin "monomer" (itself a Complex) as repeatedUnit.
-A polymer consisting of equal amounts of alpha and beta tubulin would be constructed as EntityWithRepeatedUnits containing a Complex of alpha and beta tubulins in the repeatedUnit slot.
-Completely hypothetical Example: A polymer consisting of 1 "part" of A and "4 "parts" of B (i.e. 1:4 ratio) would be represented as EntityWithRepeatedUnits containing a Complex of 1 x A and 4 x B in the repeatedUnit slot.
-Another hypothetical Example: a polymer where the ratio of individual building blocks A and B is unknown or variable is represented as EntityWithRepeatedUnits containing A and B directly in the repeatedUnit slot.

OtherEntity

Things that we need for annotation but which, due to limited knowledge or limited expressiveness of the data model, cannot be described precisely enough to be placed in any other class.
OtherEntity can be used to represent complex structures in the cell that take part in a reaction but which we can't/don't want to define molecularly. Example 1: Cell membrane. In a case in which protein X associates with the membrane, but the actual membrane component(s) with which protein X interacts are unknown, the membrane can be represented as an "OtherEntity. Example 2:kinesin-1, a microtubule motor protein, is involved in all kinds of movement in the cell, by 'walking' along microtubules, while dragging things like mitochondria, secretory vesicles, parts of the golgi, etc. Kinesins bind to these complicated structures that we would not want to describe molecularly. These structures can be create as "otherEntities".
Example 3: Holliday structure \[nucleoplasm\] ([30])

Curators should look to see if otherEntities that suit their purposes can be created, but they can create similar ones with descriptive names to differentiate them when necessary.