Tag internet - humanist @ roua.org :

version 0.9.4

Written by Romeo Anghelache no comments

Good news: Hermes development is partially supported until June 2006 by MPI for Gravitational Physics.

A more radical version of Hermes is in the works, 0.9.4 was the last one developed within the 'deadline' mindset.

My plan is to have a Hermes which generates a clean and validable document model, complete with the metadata necessary to store this document in the institutional repositories (digital libraries).


blacklisting spam

Written by Romeo Anghelache no comments

There are several methods to block spam. Yet there are some stupid individuals (SpamHaus and the likes of them) who blacklist servers which sent spam. These stupid individuals cannot imagine that the IP of a spamming server can be shared with the IP of an innocent domain, like roua.org here. As a result, no matter where I get the webhosting, I'm spam-blacklisted sooner or later, because the real spammers move, I guess. I'm left with sending messages of removal to these idiots. This is an agression.

You have no right to mess with the mail from my domain, you don't have the right to list or blacklist anything related to my domain, IP or anything else. I am buying a (webhosting) service, and you are blocking it because you didn't learn enough in school, losers. You should wake up one morning and find your door blocked in concrete, and that's because you happen to share the floor with a fellow who has done something wrong. You should be outlawed, blacklisters. And the administrators using your blacklists are incompetent. They should lose their jobs.

Is any of you, readers, clean of spam because your admin uses blacklisting? No? Then tell the dinosaur there not to use these blacklists created by self-important losers. They have no effect other than bothering people who have nothing to do with spam. Tell the dinosaur about bayesian methods and tell him to forget about using spam block lists.

Here's an analysis on list based spam (Jacob, 2003) and the bad consequences of using it.

Basically, a content based adaptive method against spam is much better than any method based on features totally external to spam, like IP addresses. The best policy is to use mozilla mail, or any other mail client which has already implemented a bayesian filter.

I don't get spam at all, on my account on roua.org. Yes, really.

There's a script I use (in Javascript), not only to hide my e-mail address from spammers, but to feed them a false e-mail address also, you can check how it works on my webpage. Note the initial string which fakes an e-mail address (that is what a spammer would get by parsing the page for addresses) and the correct one which forms only when you click/execute the script. The javascript hiding is not an original idea, but the fake e-mail available in the page to feed the spammer is.

open-access, there you go

Written by Romeo Anghelache no comments

Just happy that other people too have deep concerns about the open-access issue: Fellows of The Royal Society reacted promptly, with an open letter. (The reader would be able to put things in context by perusing also the links therein, one of which is to the RCUK position statement on research outputs, june 2005.)

TRS's president answered it by providing a list of "some" of the TRS's issues, which, by chance, makes it easier for me to address each of them, and easier for TRS to bring another set on the table after these ones will be cleared.

I'll get back on this subject as soon as I'll take a break from programming a tool relevant to open-access ;).

semantic libraries

Written by Romeo Anghelache no comments

Why semantic libraries


The scientific research and its use by the public is strongly affected by the way authors, librarians and publishers interact.

The fast evolution of the digital environment brought this interaction in a modern crisis: the scientists use today non-semantic software tools for authoring their articles (tools designed around certain types of media, rather than around the semantic document concept), while the librarians and the publishers try to sloppily recover as much ad-hoc semantics as they can to answer to their on-line users (among which are the researchers themselves). This semantics recovering effort is also one of the reasons for the recent prices escalation by commercial publishers, phenomenon which ignited reactions such as the Open Access initiative.

Ignoring the lack of semantic depth in the scientific documents produced with traditional tools will only lengthen the current crisis, any other way of solving this conflict is only masking its primary cause: the high cost of dealing with digital documents built on shallow semantics.

Without a semantic authoring language focused on domains which are relevant for the needs of scientific authors, and for the librarians and publishers involved in their research process, it is virtually impossible to improve substantially the quality of the modern research activity or to pull it out of the current flow of scientific information crisis.

The current digital technologies allow a better, systemic and long run, approach to the process of building science and its history, in an Open Access paradigm.

I salute the plans for European digital libraries, and I hope to get directly involved in them.

To construct a clear picture, I propose three definitions: semantic library, functional document, semantic authoring tool. A functional document is a digital, semantically rich, platform independent document, which allows reuse, data mining and interoperation with other digital documents or applications by providing a list of digital resources it contains and an interface to it for external entities to use (parse, manipulate).

A semantic library is a digital library built on functional documents.

A semantic authoring tool is a software providing support for building functional documents.


As an attempt to alleviate the burdening effects of putting shallow semantics documents in circulation, I propose:

  • the study, design and implementation of a human-friendly, semantically rich, authoring language for scientists, along with grammar based tools able to transform documents authored using this language into machine friendly documents (e.g. XML), and round-trip between these two structures (i.e. the authoring friendly space and the machine friendly space) while preserving the document's semantics in the process.
  • initiating an international collaborative process to construct domain specific controlled vocabularies, or semantically enriching the existing ones (e.g. OpenMath, MathML, MusicML, ChemML), by proposing appropriately focused E.U. research projects and by building collaborative consortia of interested parties (university/research libraries, publishers of science, researchers involved in language structure, data mining, domain-specific vocabularies, semantic annotations, digital ontologies, education etc.)
  • helping, through example, the scientific and education communities to become aware of the benefits of authoring documents with a flexible and layered semantic architecture for their own, and their readers, use.

No short-term project can really cover all these directions: my experience in the field suggests there are difficult, subject-specific issues of legacy to solve, as a prerequisite to making full benefit of these solutions, while the building and extending of domain-specific controlled vocabularies is, in principle, a never ending set of necessarily parallel tasks.

The optimal framework to a concerted approach to these problems is the long term study, design and creation of a set of Open Source software tools and specifications for authoring scientific documents and build, with them, semantic libraries for public use.

This direction of research will address the needs of authors, librarians and publishers in a democratic way by continuously incorporating their feedback through fully exploiting the currently typical Internet facilities (e.g. collaborative content management tools and communication standards), so that imbalances like the crisis mentioned in the context can no longer appear or last.

The author of this proposal is prepared to get involved in the development of semantic libraries at any level of detail.


The beneficial consequences of such an effort on the modern scientific research processes are multiple and deep:

  • the semantics used at the authoring stage hints the archiving agents or library engines, that means enabling a high quality library service, and a high efficiency of reusing research results;
  • the librarians and/or the professional groups will have a well defined framework for developing and refining controlled vocabularies and build richer semantic structures based on them;
  • the researchers/authors can reuse these vocabularies for better structuring their documents and for better using the documents themselves by feeding their structured sections to automata where appropriate.
  • from a library which stores semantic documents, a researcher or student can effectively assemble up-to-date monographs on the fly, based on a class of subjects of interest;
  • history of science and the issues of long term preservation can be effectively supported because the archiving process is semantics oriented, semantics which has been made available at the authoring stage, by the creators of the document, and can preserve the usability when facing a new drastic change of media.
  • a more effective scientific exchange is enabled because the semantic structures can be rendered in the notional space of arbitrary readers;
  • the publishing industry is freed to focus on providing renderings better tuned to specific users, machines or media, due to the availability of rich semantics in the original documents.

4.The big picture

  • let scientists easily interleave their own natural language with controlled vocabularies they helped create while authoring,
  • so that librarians can use the layered semantics for long term preservation and satisfying research queries with answers of higher relevance than today,
  • to enable publishers to improve the quality, and maintain a low price of their offerings by making their reader related processing orthogonal to the authoring process,
  • to encourage and build a sustainable concept of scientific self-archiving while simplifying the peer review processes.

For the legacy documents, available in physical form and waiting to be digitized, I have some comments related to their copyright.

Those interested further in this subject may want to read this simple essay on the meaning of scientific documents.

version 0.9.3

Written by Romeo Anghelache no comments

Hermes version 0.9.3 is online:

  • the library document is made of sections (type envelope, section, subsection, subsubsection, bibliography etc.)
  • it has better structured metadata,
  • the citations/bibliography have gone semantic too (no longer on the fly id generation)
  • some bugs fixed (DeclareMathOperator of AMSLaTeX works fine now).
  • the publishing stylesheet is more elegant and exports the metadata in the xhtml typical fields

The example collection of converted articles (XML+MathML+Unicode) is available here.


Rss feed of the tag