October 5th, 2005

Why semantic libraries


The scientific research and its use by the public is strongly affected by the way authors, librarians and publishers interact.

The fast evolution of the digital environment brought this interaction in a modern crisis: the scientists use today non-semantic software tools for authoring their articles (tools designed around certain types of media, rather than around the semantic document concept), while the librarians and the publishers try to sloppily recover as much ad-hoc semantics as they can to answer to their on-line users (among which are the researchers themselves).
This semantics recovering effort is also one of the reasons for the recent prices escalation by commercial publishers, phenomenon which ignited reactions such as the Open Access initiative.

Ignoring the lack of semantic depth in the scientific documents produced with traditional tools will only lengthen the current crisis, any other way of solving this conflict is only masking its primary cause: the high cost of dealing with digital documents built on shallow semantics.

Without a semantic authoring language focused on domains which are relevant for the needs of scientific authors, and for the librarians and publishers involved in their research process, it is virtually impossible to improve substantially the quality of the modern research activity or to pull it out of the current flow of scientific information crisis.

The current digital technologies allow a better, systemic and long run, approach to the process of building science and its history, in an Open Access paradigm.

I salute the plans for European digital libraries, and I hope to get directly involved in them.

To construct a clear picture, I propose three definitions: semantic library, functional document, semantic authoring tool.

A functional document is a digital, semantically rich, platform independent document, which allows reuse, data mining and interoperation with other digital documents or applications by providing a list of digital resources it contains and an interface to it for external entities to use (parse, manipulate).

A semantic library is a digital library built on functional documents.

A semantic authoring tool is a software providing support for building functional documents.


As an attempt to alleviate the burdening effects of putting shallow semantics documents in circulation, I propose:

  • the study, design and implementation of a human-friendly, semantically rich, authoring language for scientists, along with grammar based tools able to transform documents authored using this language into machine friendly documents (e.g. XML), and round-trip between these two structures (i.e. the authoring friendly space and the machine friendly space) while preserving the document’s semantics in the process.
  • initiating an international collaborative process to construct domain specific controlled vocabularies, or semantically enriching the existing ones (e.g. OpenMath, MathML, MusicML, ChemML), by proposing appropriately focused E.U. research projects and by building collaborative consortia of interested parties (university/research libraries, publishers of science, researchers involved in language structure, data mining, domain-specific vocabularies, semantic annotations, digital ontologies, education etc.)
  • helping, through example, the scientific and education communities to become aware of the benefits of authoring documents with a flexible and layered semantic architecture for their own, and their readers, use.

No short-term project can really cover all these directions: my experience in the field suggests there are difficult, subject-specific issues of legacy to solve, as a prerequisite to making full benefit of these solutions, while the building and extending of domain-specific controlled vocabularies is, in principle, a never ending set of necessarily parallel tasks.

The optimal framework to a concerted approach to these problems is the long term study, design and creation of a set of Open Source software tools and specifications for authoring scientific documents and build, with them, semantic libraries for public use.

This direction of research will address the needs of authors, librarians and publishers in a democratic way by continuously incorporating their feedback through fully exploiting the currently typical Internet facilities (e.g. collaborative content management tools and communication standards), so that imbalances like the crisis mentioned in the context can no longer appear or last.

The author of this proposal is prepared to get involved in the development of semantic libraries at any level of detail.


The beneficial consequences of such an effort on the modern scientific research processes are multiple and deep:

  • the semantics used at the authoring stage hints the archiving agents or library engines, that means enabling a high quality library service, and a high efficiency of reusing research results;
  • the librarians and/or the professional groups will have a well defined framework for developing and refining controlled vocabularies and build richer semantic structures based on them;
  • the researchers/authors can reuse these vocabularies for better structuring their documents and for better using the documents themselves by feeding their structured sections to automata where appropriate.
  • from a library which stores semantic documents, a researcher or student can effectively assemble up-to-date monographs on the fly, based on a class of subjects of interest;
  • history of science and the issues of long term preservation can be effectively supported because the archiving process is semantics oriented, semantics which has been made available at the authoring stage, by the creators of the document, and can preserve the usability when facing a new drastic change of media.
  • a more effective scientific exchange is enabled because the semantic structures can be rendered in the notional space of arbitrary readers;
  • the publishing industry is freed to focus on providing renderings better tuned to specific users, machines or media, due to the availability of rich semantics in the original documents.

4.The big picture

  • let scientists easily interleave their own natural language with controlled vocabularies they helped create while authoring,
  • so that librarians can use the layered semantics for long term preservation and satisfying research queries with answers of higher relevance than today,
  • to enable publishers to improve the quality, and maintain a low price of their offerings by making their reader related processing orthogonal to the authoring process,
  • to encourage and build a sustainable concept of scientific self-archiving while simplifying the peer review processes.

For the legacy documents, available in physical form and waiting to be digitized, I have some comments related to their copyright.

Those interested further in this subject may want to read this simple essay on the meaning of scientific documents.

