I will be enumerating here a couple of pressing issues, related to semantic authoring and preservation, in the context of digital creation, administration and usage of scientific documents; accordingly, some present and future solutions to these are sketched out.
Status of authoring Although some standards related to the structuring of documents have emerged: DocBook for structuring generic documents (books, articles), MathML and/or OpenMath for semantically clean authoring of mathematical expressions, MARCXML for representing and communicating bibliographical records and related metadata, then Unicode for unambigous specification of international or domain specific symbols, and so on, currently, the authors of scientific documents still use TeX or an alternative, proprietary software solution, for authoring.
That is, the scientist is still in the same situation as 10-20 years ago, while authoring his articles or books: no clue as to how these standards can be of help to him, no effective, open source or otherwise, tools, to help him make use of them effortlessly. Why?
Stating the issues Part of the answer is that neither the publishers, nor the librarians helped the author become aware of, or concerned with, the fate of their own written works. This awareness was not an urgent matter in the paper publishing era (the article will last as long as the paper and sit on a shelf), but in the digital document era it becomes a real issue: it is easy and cheap to create multiple versions or multiple copies of a digital document, so how can the author make sure that these versions are not being corrupt in the process, or their rendering is not broken at a later time (when the reader accesses it), or that they are stored in a place where an indexing machine can find it and list it in on the appropriate query?
The answer to this question is of a much higher priority than, say, digital access rights, unless one chooses to protect a corrupted representation of one's work.
The answer is bound to rely on the open standards noted above.
In comparison to these, proprietary formats and proprietary document authoring solutions do not guarantee an appropriate rendering (or meaning) in the future (be it near or far), unless they commit to a standard semantic vocabulary (or a set of them) which should be used by the author while editing his document.
Defining vocabularies with a meaning (that is, with a formally defined way to use them) is an exciting research topic today (the steps and standards needed to create ontologies in the digital era, are detailed by others), but one cannot reasonably expect an author to suddenly jump from writing plain text or mathematical expressions directly to using ontology defined concepts, simply because the authoring process becomes tedious and would resemble more to computer programming; practically the author is still helpless in ensuring that his work will be reachable and useable after a period of time.
The ontologies are more helpful in extracting and managing the knowledge created by the authors and machines. We are, though, concerned here mainly with the knowledge creation process.
The need for an effective authoring solution, positioned between being useful directly to the machines and being plain simple to humans to type, is becoming obvious. A bias towards protecting the time of the human authors will be present at sketching a solution in the following sections.
What do I mean? To whom? These are common questions in the author's mind: the meaning of his work is its capability of being used for a purpose (whether intended or not).
A handwritten article will have a meaning to an appropriately educated human; a computer typed text will have a meaning to some rendering, printing or indexing software (this is the lowest level semantic layer in a digital document) and a different meaning to the final human reader (presumably the highest level of semantics); again, a scan of an old article will have a meaning for the graphical rendering software, another meaning for the character recognition software and a different meaning for the final human reader.
We note, even if it sounds to some as a trivial statement, that an article is, in all cases, meant to be found, read and used by a human being: it is, in short, a message.
The machines can help in the process: index an article, act in a certain way while a specific expression is found (flag a misspelling, validate an expression or start an external process), advertise the presence of the article to the interested audience, check its consistency according to the available semantic rules, render it on different media, append a reader's comments to a section of it, store it in the appropriate digital library slot and relate its presence to the other neighbouring articles, keep a version history of it, assemble it with other documents according to an editor's, or library user's, request.
These functionalities depend on the availability of the semantic layers in the digital document. A collection of such documents, with the services they enable, would form a semantic library.
Some of these layers can be hinted by the author: the computer cannot even infer where a paragraph starts unless the author types some specific keys, it also cannot relate accurately concepts (the consequence of this is the inability of getting effectively useful search results) without the author's hints to a vocabulary of concepts.
Defining a semantic solution The cardinality of this set of hints should stay minimal while maximizing the functional space to which the document can be made part of. The fuzzy constraint to this problem is the author's patience: he has always the alternative of creating a semantically flat document at the cost of his editors' time and his audience's time and size (a cost which is almost invisible at the time of authoring).
One can name the above requirement: user-friendliness of the authoring package.
But also, the author of a scientific article wants to communicate something and to preserve that message for future readers.
This requirement means: the authored document has to have a well defined structure. Well defined, in turn, means that the document should satisfy the following conditions, at the end of the authoring process:
- be created in an open format which is platform neutral (XML),
- contain enough information to locate it (administrative metadata: author, date etc.),
- contain enough semantic hints for a librarian to store, preserve and manage it (document structure definition for a validating procedure)
- contain enough presentational semantic hints for a publisher to render it or relate it to other documents, (TeX-like suggestions about how some symbol should look like)
- contain enough hints for the reader to locate and use it (using consistently semantic vocabularies defined in open standards, e.g.MathML-content; and using keywords as often as, and wherever, necessary).
The authoring tool The requirements above can be satisfied by an authoring tool allowing the author to type cursively, instead of switching between the computer's input devices.
The author should be able to type natural language and expressions belonging to controlled vocabularies. The authoring tool should provide a straightforward way of creating new definitions by combining older ones, or just renaming them into a shorter/friendlier form, adapted to that specific author's needs and habits, without deteriorating the semantics in the background. The author should also be able to apply semantic emphases on portions of the message. These become hints to the search agents, and shrink the dimensionality of the search space. This technique enables a much faster location of, say, particular concepts or subtle differences the authors want to point out. Usage of this latter feature would make circulation and reviewing of concepts more fluid than it is today.