http://www.elpais.com/articulo/sociedad/Exploiting/Biomedical/Information/through/Literature/and/Ontologies/elpepusoc/20111107elpepusoc_6/TesProteins perform a wide variety of functions in living organisms, and identifying and understanding these functions can demystify many processes essential to life. Such knowledge can have profound impacts on human health, agriculture, energy, and the environment by way of new and more effective treatments, more productive crops, and the less-polluting biological materials to replace chemical materials.
However, identifying what exactly a given protein does is a difficult task not only because of the sheer number but also because the function of each protein is normally regulated by other proteins. Although the sequence in which amino acids - the building blocks of proteins - are arranged is known in the case of 9 million proteins, only 400 000 are known in terms of their function. As the techniques of determining the exact sequence of amino acids in a protein - its unique fingerprint - become more powerful, the number of known proteins keeps rising exponentially (Figure 1)?which traditional manual methods to determine the function of a protein can never catch up.
It is to meet this challenge that automatic tools have been created to predict the functions from what we already know about the functions of similar proteins. However, this approach is complicated by two observations, namely that some proteins are unique - no existing protein is quite similar to them - and that a similar sequence does not always mean a similar function. Lastly, no matter how the function of a protein has been arrived at, researchers still need to verify that function before they can use that information?and the process of verification is complex if no clear and sound evidence is provided.
In identifying the functions of a protein from the proteins similar to it or even in proposing that a given protein has novel functions, researchers often rely on published biomedical literature, a hard and time-consuming task even for experts and made harder by the sheer volume of published literature. For example, PubMed, a commonly used database of documents in the life sciences, currently comprises over 16 million records. In 2008 alone, 671 904 documents were added to the database. Even if a researcher were to read all of them, reading 10 documents a day, the task would take more than 180 years? Thus, we certainly need more efficient strategies to exploit the literature.