In order to provide the ISPIDER team with a means of prioritising and focussing our work, we have begun to collect some candidate use cases that involve integration of proteomic data sets. As an example, the following presents the first cases we implemented to enable Value-Added Protein Identification.
Advances in dimensional gel electrophoresis and mass spectrometric techniques together with sequence database correlation have enabled accurate identification of proteins. Using such techniques, it is possible to obtain experimental verification of predicted coding sequences and identify novel genes. On the other hand, diagnosing the obtained protein function can give a precise view of its profile, e.g., its molecular function, the biological process it is involved in and its classification. Accordingly, the objective of the present scenario is to specify and implements a workflow application for identifying and classifying proteins. To do a such, the workflow combines the capabilities of mass spectrometric techniques with protein classification tools.
that this use case is not supported)
scenario is a two-step process. Given a protein, it first undergoes a
proteolytic digest that produces a set of peptide molecular weights.
The identification process is based on comparisons of peptide
molecular weights determined by mass spectrometry, with the
theoretical masses of peptides produced in silico by digestion
of a set of sequences in a target database, e.g., Swissprot. Obtained
proteins are then ranked according to the associated protein
function and family. The remainder of this document is organized as
follows. Section presents the Web services that are involved within
Two main Web services are involved within the scenario, pepmapper and Gene ontology.
Pepmapper is a peptide mass fingerprinting tool that uses mass spectrometry data produced by the digestion of a protein to identify a match to a protein from a user selected database. It takes as input peptide masses obtained from masses spectrometric, a database name containing proteins against which peptide masses are to be compared, and an error that specifies the degree of error that the user estimates may be allowed for the masses of the peptides obtained from the mass spec. This error may be measured in parts per million (PPM), Daltons (Da) or Thompsons (Th). The error value will decrease the specificity of the search. The execution of pepmapper returns the set of proteins that match the peptide in the database.
Gene Ontology (GO) project is a collaborative effort that addresses the need for consistent descriptions of gene products in different databases. GO has developed three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. Molecular functions generally correspond to activities that can be performed by individual gene products. A biological process is accomplished by one or more ordered assemblies of molecular functions. Examples of broad biological process terms are cell growth and maintenance or signal transduction. A cellular component is a component of a cell. In our scenario, we use three operations, GODBGetNamebyID, GODBGetClassification and GODBGetParentOf. GODBGetNamebyID takes a Gene Ontology accession number and returns the associated term name. For example, given the Gene ontology accession number GO:0005839, GODBGetNamebyID returns roteasome core complex (sensu Eukaryota). GODBGetClassification takes a Gene Ontology accession number and returns the associated role, e.g., cellular_component. GODBGetParentOf operates on Gene Ontology (GO) vocabulary graphs. It retrieves a list of processes, functions or components situated directly above a given GO identifier in the graph.
Auxiliary services The integration of pepmapper and GO requires additional services that deal with mismatches between pepmapper output and GO input. Indeed, the pepmapper Web service produces a Swissprot accession number, whereas the GO Web service consumes Gene Ontology accession number. The following services have been used to cope with these mismatches.
Together, the above services are used to associates a Swissprot entry with its corresponding GO accession numbers. Given a Swissprot accession number, getSwissEntry is first called to get the associated entry. Thereafter, GetInterpro is used to extract the IPR accession numbers. Finally, Interpro2GO retrieves the GO accession numbers corresponding to the IPR ones.
aforementioned services manipulate different kind of data. This
section specifies the data structures which are involved in the
scenario by underlying the relationships among them. They are
represented using a UML class diagram (see Figure 1).
Figure 1: Data structure (UML)
Peptide masses represent the masses of the peptides obtained from a Mass Spectrometry experiment. Peptide masses are associated with one or several SwissProt entries. Such an association is established through Pepmapper execution.
SwissProt_Entry SwissProt is an annotated protein data source maintained by the Swiss Institute of Bioinformatics (SIB) and the European Bioinformatics Institute (EBI). The SwissProt_Entry class represents a SwissProt entry. It is characterized by the following attributes. EntryName is the identifier for the entry; it consists of a string of alphanumeric characters, starting with a letter. PrimaryAccessionNumber represents the principal means of identifying a sequence. Secondary accession numbers are used to allow tracking of data when entries are merged or split. For example, when two entries are merged into one, a "primary" accession number is created, and the accession numbers of the merged entries are set as secondary numbers. ProteinName specifies the proposed official name of the protein. GeneName specifies the preferred scientific name of the organism which was the source of the stored sequence. Sequence represents the sequence data. Note that the Swissprot entry contains other attributes such as the comments, references, the copyright and so forth. As an example, Appendix A provides the Swissprot PSA2_RAT entry.
CrossReference is used to locate data sources that contain information about SwissProt entries. The CrossReference class is characterized by the identifier of the remote database, e.g., DDBJ, DIP, Ensembl and GenAtlas, and the accession number of the associated entry.
Interpro_entry Interpro is a documentation source for protein families, domains and functional sites. The Interpro_entry class represents entries within the Interpro source. Every entry has a name and an accession number of the form IPRxxxxxx, where x is a digit. Type specifies whether the entry is a Family, Domain, Repeat or Site. Similar to Swiss-Prot, Interpro contains many others attributes that we omit for the sake of clarity.
GO_Term An Interpro entry is associated with one or more GO terms. GO terms have been developed by the Gene Ontology consortium to allow the description of gene products. Every GO_Term is identified by an accession number. It is characterized by a name, and name space which is used for classifying associated gene products. Three categories are possible: molecular function, biological process and cellular component. GO terms are organized in structures called directed acyclic graphs (DAGs). They are related among each other using two relationships ParentOf and ChildrenOf. ParentOf associates each GO term with the GO terms situated directly above it in the graph. ChildrenOf associates a GO term with the GO terms situated directly below it in the graph.
Having presented the involved Web services and data structures, this section presents the Taverna workflow implementing the scenario. Figure 2 depicts such a workflow. It takes as input the name of the peptide masses, the name of the database used for identifying the protein, the identification error, and the mapping database that associates SwissProt entries to Interpro entries. In response, it replies with the name of the identified protein, the associated GO identifier, GO name and classification. Furthermore, it specifies the name of the associated GO term parents with respect to the GO ontology.
Figure 2: Protein identification and classification workflow
As shown in Figure 2, the execution of the workflow goes through the following steps. Given the peptide masses, the pepmapper Web service is first executed to identify the protein by returning its SwissProt accession number. The operation getSwissprotEntry is executed to get the associated Swissprot entry. This entry is then parsed to extract the IPR accession numbers (operation main). Given the IPR accession numbers, the operation filter gets the associated GO identifiers. To do so, it uses the mapping database. Finally, the obtained GO identifier is used to retrieve the GO term by calling GODBGetNamebyId operation, and its classification using GODBGetClassification operation. Also, the names of the GO term’s parents are acquired by applying the operations GODBGetParentOF and GODBGetNamebyID in sequence.
We reported an experimentation that we conducted to implement a workflow for identifying and classifying proteins. By the end of such experimentation it becomes clear to us that current workflow systems, including Taverna, provide a means for integrating proteomics data by orchestrating existing Web services. However, they are by no means sufficient and further mechanisms must be provided before enabling biologists to effectively and straightforwardly specify and execute scientific workflows. For example, we have been required to: (i) implement a Web service that wrap Pepmapper tool, and most importantly (ii) implement two Web services GetInterpro and Interpro2Go to deal with data structures mismatch between Pepmapper and the GO services.
It is unrealistic to envisage implementing Web services for each scientific workflow in order to deal with data structures mismatch. This suggests the design and implementation of a generic mechanism that deals with such aspects and can be incorporated within the workflow engine.