free software - humanist @ :

Diachronic publishing and the Octopus document

Written by Romeo Anghelache no comments

This piece was reachable on my website at a different URL, since 1999, but I just decided to move it here. ****

Abstract (of a talk at IUK99), 1999, March 24

A possible way of maintaining validity and accessibility of references in an electronically published scientific document, is presented. A definition of Active Brokers Network, as a technical mean of maintaining valid links from the document containing the reference towards the referenced documents, is tried. Such tools would enable diachronic publishing.

Detail Science is built step by step starting from an idea, axiom, reasoning or empiric result. So, a further step must reference the earlier one(s). That is, an actual reader of a scientific paper (A) must have access to the context, foundation, premises and details of what is read, therefore (A) contains references which must remain valid over time if (A) and the citations used in it are to remain a comprehensive document.

Many of us are still spending a lot of time for searching/piling cited papers in order to cover the most of the subject in a read paper. An Internet based solution for citing/referencing has naturally a dynamic character as opposed to the printed paper containing static pointers to other sources of information.

We read some (Bx) electronically stored documents and want to cite them when writing an (A) document. We can do that by HTML but what if one/all of those (Bx) documents are moved in another public place? A cited scientific paper (B) maintains its validity as a referenced document if it remains accesible and its content doesn't change over time. How can we keep the validity of the reference?

Some partially active solutions of keeping it accessible might be:

* (HTML) (B) leaves traces and the reader of (A) tracks them from the initial location and updates the reference to (B) to its final location; this is obviously inefficient because traces can be erased, or can become very long. * (XML) (A) contains multiple references to the already mirrored (B); we encounter the same inefficiency because of the static character of the references.

Using entirely active solutions might be more promising, so where should we implement the “activeâ€? mechanism: in (A), in (B) or at a point between them? (A) would make traffic by checking the location of (B) periodically, (B) doesn’t care/know about (A) therefore we need a middlepoint activation: an Active Brokers Network (ABN) to which (B) beeps when it’s moving. (A) declares its own properties to an AB point (ABP) and cites the (potentially different) ABP to which (B) has already declared its properties.

The properties (A) should declare to the ABP might be the classical ones: title, author, date of creation, keywords, its (ABN assigned) Unique Identifier (UID), and the ABN UIDs of the cited documents.

Therefore the ABN is a distributed database of (the above) properties which should relate UIDs to the (older) referenced UIDs. A query on it should reveal paths through series of scientific articles having in common a property. Obviously, references point backwards in time, this feature may be used as a part of the validation criteria of an electronic reference or/and as a simplifying constraint of searching.

Actually, a new (A) document should look this way as a folder, containing, beside its internal objects (texts, equations, data tables, scripts, graphics..) other folders representing the referenced (B) documents and so on recursively. That is, every (A) becomes a virtual and distributed file system, based on ABN.

The addition of a document to an ABN may be done through an enhancement to the current operating systems, e.g. adding an active “public/local/privateâ€? property beside the passive “rwxâ€? ones. When a document gets the “publicâ€? attribute, the operating system managing it should beep to the closer ABP its creation/movement.

The ABPs could implement various mechanisms such as obsoletingof its documents (but better not if we want a history of science, once a scientific document is made “publicâ€? it must be frozen and cited as it is), or extinction: a document which is not cited at all for a long duration (say, ten years) should dissapear (this way, scientific garbage is thrown out simply by the Time).

Possible results of implementing the Active Brokers Network:

* General o boneless, flexible documents (Octopus) which can incorporate knowledge related to a subject over time, i.e. diachronic publishing. The Octopus document becomes a continuously growing monograph written by several authors and thus it tends to form itself as an exhaustive/comprehensive unit of scientific knowledge. o a keyword search in such a distributed database would reveal paths (unidirectional in time) with referenced/referring papers creating ad-hoc dynamic documents structured and focused on a specific scientific subject. These things bring to mind the percolation and all the mathematics/physics associated with random walks allowing a deeper formal understanding of information retrieval and structured knowledge. o a search by keyword AND author may reveal true schools of thought in science. * Publishing o the refereeing process of (B) can be done by the authors of (Ax) which cite (B) being interested_in / aware_of its content and not by hidden, possibly indifferent, readers. o the editor's selection process of a review's issue can be done by simple path selections through these Octopus documents .

version 0.9.4

Written by Romeo Anghelache no comments

Good news: Hermes development is partially supported until June 2006 by MPI for Gravitational Physics.

A more radical version of Hermes is in the works, 0.9.4 was the last one developed within the 'deadline' mindset.

My plan is to have a Hermes which generates a clean and validable document model, complete with the metadata necessary to store this document in the institutional repositories (digital libraries).


blacklisting spam

Written by Romeo Anghelache no comments

There are several methods to block spam. Yet there are some stupid individuals (SpamHaus and the likes of them) who blacklist servers which sent spam. These stupid individuals cannot imagine that the IP of a spamming server can be shared with the IP of an innocent domain, like here. As a result, no matter where I get the webhosting, I'm spam-blacklisted sooner or later, because the real spammers move, I guess. I'm left with sending messages of removal to these idiots. This is an agression.

You have no right to mess with the mail from my domain, you don't have the right to list or blacklist anything related to my domain, IP or anything else. I am buying a (webhosting) service, and you are blocking it because you didn't learn enough in school, losers. You should wake up one morning and find your door blocked in concrete, and that's because you happen to share the floor with a fellow who has done something wrong. You should be outlawed, blacklisters. And the administrators using your blacklists are incompetent. They should lose their jobs.

Is any of you, readers, clean of spam because your admin uses blacklisting? No? Then tell the dinosaur there not to use these blacklists created by self-important losers. They have no effect other than bothering people who have nothing to do with spam. Tell the dinosaur about bayesian methods and tell him to forget about using spam block lists.

Here's an analysis on list based spam (Jacob, 2003) and the bad consequences of using it.

Basically, a content based adaptive method against spam is much better than any method based on features totally external to spam, like IP addresses. The best policy is to use mozilla mail, or any other mail client which has already implemented a bayesian filter.

I don't get spam at all, on my account on Yes, really.

There's a script I use (in Javascript), not only to hide my e-mail address from spammers, but to feed them a false e-mail address also, you can check how it works on my webpage. Note the initial string which fakes an e-mail address (that is what a spammer would get by parsing the page for addresses) and the correct one which forms only when you click/execute the script. The javascript hiding is not an original idea, but the fake e-mail available in the page to feed the spammer is.

Hermes source availability, funding issues

Written by Romeo Anghelache no comments

Although the XML examples are converted with a newer version of Hermes than 0.9.3, the source distribution of this newer version won't be available for download until I get some (reliable promise of) funding explicitly for developing Hermes further (since October 2005), those who are in need can use alternative tools, like tex4ht, or continue the Hermes development on their own resources (ironically, while E.U. is making a lot of fuss about digital libraries, I'm having difficulties finding a related job, my CV is here and my offer is written here.).

However, when I get hired as a research programmer in knowledge representation, as a digital content library architect, or in a related area, I would then be able to continue releasing Hermes updates on my spare time (even if it won't be part of the job).

Here's the list of updates since version 0.9.3

  • Formatting, both in the MathML mapping and in the publishing stylesheet, has been refined for all the fonts currently supported.
  • Added font mappings for the pxfonts and txfonts; The full list of currently supported fonts is: cmr, cmu, cmb, cmsl, cmti, cmtex, cmtt, cmtcsc, cmssdc, cmss, cmmi, cmmib, cmsy, cmbsy, cmbx, cmbxti, cmex, msam, msbm, eufm, eurm, eusm, euex, cmcsc, wncyr, wncyi, rsfs, lassdc, pplr8t, pplri8t, pplb8t, zppler7y, zppler7t, zppler7m, zppler7v, zppleb7y, zppleb7m, zppleb7t, lasy, wasy, ecrm, ecb, ecbi, ecbx, ecti, ecsc, ecsl, pxr, pxi, pxb, pxbi, pxbsc, pxbsl, pxsl, pxsc, pxss, pxsssl, pxsssc, pxtt, pxbtt, pxttsl, pxbttsl, pxttsc, pxbttsc, p1xr, p1xi, p1xb, p1xbi, p1xbsl, p1xbsc, p1xsl, p1xsc, p1xss, p1xsssl, p1xsssc, p1xtt, p1xttsl, p1xttsc, pxmi, pxbmi, pxmi1, pxbmi1, pxmia, pxsy, pxsya, pxbsya, pxsyb, pxbsyb, pxsyc, pxbsyc, pxex, pxexa, txr, txi, txb, txbi, txbsc, txbsl, txsl, txsc, txss, txsssl, txsssc, txtt, txbtt, txttsl, txbttsl, txttsc, txbttsc, t1xr, t1xi, t1xb, t1xbi, t1xbsl, t1xbsc, t1xsl, t1xsc, t1xss, t1xsssl, t1xsssc, t1xtt, t1xttsl, t1xttsc, txmi, txbmi, txmi1, txbmi1, txmia, txsy, txsya, txbsya, txsyb, txbsyb, txsyc, txbsyc, txex, txexa.

version 0.9.3

Written by Romeo Anghelache no comments

Hermes version 0.9.3 is online:

  • the library document is made of sections (type envelope, section, subsection, subsubsection, bibliography etc.)
  • it has better structured metadata,
  • the citations/bibliography have gone semantic too (no longer on the fly id generation)
  • some bugs fixed (DeclareMathOperator of AMSLaTeX works fine now).
  • the publishing stylesheet is more elegant and exports the metadata in the xhtml typical fields

The example collection of converted articles (XML+MathML+Unicode) is available here.


Rss feed of the category