MULTILINGUAL CONTENT MANAGEMENT

Publicaton year: 2008

Jacob Palme
Department of Computer and Systems Sciences, Stockholm University, Stockholm, Sweden.
e-mail:jpalme[AT-SIGN]dsv.su.se

This paper in PDF format.

Abstract

Some web sites provide their information in multiple languages. This paper discusses the experience in developing such a web site (http://web4health.info/). Such development is greatly simplified, if the software clearly separates between language-independent structure and language-dependent information, so that changes in the language-independent structure can be done in one operation for all languages. Important is also work flow support, since different people do different tasks in the production, such as the writing of a text and its translation to other languages.

This work was partly financed by the Commission of the European Communities. An earlier, shorter version of this paper was published in the proceedings of the Terena Networking Conference, 2005. The present paper is however improved in several respects compared to that paper.

1 Introduction

1.1 Multilingual web sites

More and more often, organizations need to provide their information in several languages. Many web sites offer visitors a choice of which language to use in viewing the contents of the web site. The management of such a web site raises a number of issues. This paper reports on the experience from the actual development of such a web site.

1.2 Why multiple languages?

It is often taken for granted that English is and always will be the lingua franca of the Internet and perhaps also of the world at large. This is however not the case, writes [17]:

Although English with its slightly less than 400 million native speakers is the second largest language today (Chinese is the first and has three times as many), it is estimated [15] that English will become one of four languages with around 520 million native speakers in 2050 (e.g. 520 million plus/minus 40 million - the other three languages being Hindi/Urdu, Spanish and Arabic). In fact, as English-speakers on a global basis have relatively low birth rates, the global proportion of native English-speakers is expected to shrink from over 8% in 1950 to less than 5% in 2050 [15].
With satellite television, less than half the expected European audiences turned out to be fit for English-language television and less than 3 percent had an excellent command of English in countries such as France, Spain and Italy [16]. We do know from satellite television - an area more mature than the Internet - that viewers want television programming to be in their local language [16]. As Internet access continues to diffuse, there is no reason to believe the situation would be different regarding what language Internet users want their web pages and their e-mail conversations to be in. These questions in fact have to do with the very identity of many millions or even billions of people.
Market research from a few years back that showed that more than 50% of the Internet users speak a native language other than English, that 37 million Americans do not speak English at home (US Department of Health), that web users are up to four times more likely to purchase from a site that communicates in the customer's native language (IDC) and so on [17].

If you are doing Search Engine Optimization (SEO), multi-lingual content will multiply the number of visitors by multiplying the number of search strings you are optimzed for. See more about SEO in chapter 7.

1.3 The Web4Health web site

The web site developed has the name Web4Health, at the address http://web4health.info/. When this is written (July 2008), it contains about a thousand informational texts for laymen in the area of mental health. Most of the content is available in German, English and Swedish, some of it also in Finnish, Greek, Italian and Polish. In recent months the web site had nearly one million visitors who viewed nearly three million pages.

The content of the web site was developed by medical experts in Germany, Greece, Italy, the Netherlands and Sweden. Each medical expert provided texts in their native language and/or English, and also translated the informational pages from other languages (mostly English) to their native language. The software does not require any master language, any text can first be provided in any language and then translated directly to other languages, or indirectly with, for example, English as the intermediate language.

Most of the content is the same in all languages, but each medical partner was free to decide what to include and also could modify the text to suit the needs of each language region when translation to his/her language, and was also free to add additional pages only available in a specific language.

The services provided by the web site for its visitors are:

  1. Access to the informational pages through a hierarchical structure (taxonomy) of menus. One page can be placed in multiple positions in this structure.
  2. Access to the informational pages using a natural-language question-answering system.
  3. An ask-the-expert area, where visitors can ask a question not covered by the web site and get a personal answer from one of the medical experts.
  4. Forums for discussion of mental health issues.

1.4 Content-management system

To manage the development and translation of the content, a multilingual content management system was developed for this project. Content management systems [11] are software systems specifically designed to handle large sets of documents, such as web sites with many pages [2]. According to [9], there are more than 225 software vendors supplying content management systems, even though this is a very new market, which has only existed for a few years, but our system has special features, described in this paper, not available in most other such systems (A system which has some such features is described in [10]). This paper describes the main principles of our content management system. It will explain the advantages of the system design, but also discuss drawbacks and how an ideal multilingual content management system should work. It also describes some features which we now understand that we should have implemented, but which are not ready yet.

1.5 Natural-language question-answering system

The natural-language question-answering system (QuickAsk) [8], [3], [4] used in the web site is based on templates. For each answer, one or more templates are developed, which will match many different variations of questions, for which this answer is suitable.

Example 1 of a template:

need* have must help* go went necessar* ; doctor* psychologis* psychother* profession* expert*

How this template works:

Synonyms of "must", "need", "help"

synonyms of "doctor", "psychologist", "expert"

Must I get help from

a doctor to stop smoking?

Do I need

a psychologist?

...help necessary

Is expert...

Example 2 of a template:

$eat $food ; sensibl* rational* levelhead* reasonab* unreasonb* prudent* intelli* sane* insane* unrealist* realist* thoughtful* credib* understand* know* clearhead* bright* perspica* precept* astut* smart* apt suitab* witt* shrewd* $good together party* partie* band* company* bunch* group* gathering* alone lone* solitar* gregarious* secluded* single* desolate* separat* friend* accompan* unaccompan* [on ; $people ; own] [ with by at in # $people ]

This template will match for example the following questions:

This system requires that one or more such templates is constructed for each answer. During usage, the questions asked by actual users are logged, and these log files are used to check if the system produces suitable answers - when not, either the templates may need to be revised or a new informational text written. The total time spent on producing a good template for each answer is 15-60 minutes, including time spent on testing the templates and on revising them based on usage logs.

Step-wise refinement of this, by investigating logs of actual user questions, and adjusting the templates to new variants of the questions, is very important. A problem is that this work, done for one language, will not convey its results for other languages. Tools to overcome this problem are discussed in chapter 5.

An alternative would be to use traditional so-called free-text search tools, which automatically match questions to words in the answers. The advantage with the system we used is that it more often will find the best answer to a question, and that the search response will contain less unsuitable answers (in information retrieval terminology, our system will give higher recall and precision than free-text search tools). The advantage with free-text search tools, of course, is that the manual work of producing the templates is not needed. However, good quality free-text systems usually include manually added keywords to each answer to improve the quality of the results. And managers of free-text systems often scan log files of less successful user queries, and modify the content so that a better answer will be provided, the next time the same question is asked. Thus, to get very good search results, such systems may require as much manual work as QuickAsk.

Two master's students at DSV [12] have compared QuickAsk with using Google with site-restriction to only "site:web4health.info/sv/". The test was done with 50 randomly selected actual questions from the log files of questions asked to the system. They found that the natural-language question-answering system found a good answer to 90 % of all questions, compared to only 68 % for Google. Traditional so-called free text search systems were usually less good than Google, except for SiteSeeker, which was better than Google (78 %) but still not at all as good as the natural-language question-answering system QuickAsk. SiteSeeker achieves better results than Google by understanding misspellings and Swedish-language conjugations better than Google and also because manually added keywords based on logs of unsuccessful queries are used to improve the result quality.

2 The KOM2002 content management system

Below is described how the KOM2002 content management system (CMS) works at present. Note that this is not ideal, in chapter 3 is discussed how an ideal system should work:

KOM2002 allows the handling of objects, which can be split into fields. Each object can exist in more than one language. Fields can be marked as being identical in all languages, having possibly different text in different languages, and in the latter case whether Google machine translation [7] provides an initial text as an aid to the human translator in producing a new translation (all pages are translated by humans, but machine translation can be used as input to the human translator), whether a field is mandatory, etc.

Objects can be linked to each other, for example all pages are linked to an area where page texts are stored, other texts, like synonym lists and stop lists for the natural-language answering engine, are linked to an area for export objects other than texts. Comments and discussion of a page, internally between the developers, can be stored in a separate discussion area associated with each page being developed. More complex link structures can be defined for handling of, for example, work flow applications.

When an object is to be translated, the translator locates the source version, and chooses the target language. A window is then opened, which shows the text in both languages side by side. Some of the fields are automatically filled with Google translations (See Figure 1). The translator can then supply the text in the target language, and, when ready, submit the translation.

Figure 1: Part of the window when doing a translation:

Translate from: German

Language:

Titel: info Hilfe

No ads:

Frage(n): info

Bezeichnung: info

Reference name:

Geschrieben am: info

Zuletzt aktualisiert: info

Text: info

Reference name:

Geschrieben am: info

Zuletzt aktualisiert: info

Text: info

Translate to: English

Language:

Title: info Help

No ads:

Question(s): info

Identifier: info

Reference name:

Date-created: info

Date-last-modified: info

Body: info

Reference name:

Date-created: info

Date-last-modified: info

Body: info

The translation window can also be used when modifying a translation. For example, if a change has been made in the German original, and an English translation already exists, the translation window can be used to see the new German text side by side with a window where the change can be copied to the English translation. This feature is especially valuable for changes in the natural-language classification, since this is often changed in order to better cater to experience from the system logs of how well the natural-language question-answering system works.

The content-management system also has commands to export pages, when ready, to the static pages available to external visitors, and to information in the data base used by the natural-language question-answering system. The same page is usually exported to multiple exported pages, such as a page for screen viewing, a page for printing, a page with a list of sources and a page as stored in the data base of the natural-language question-answering system. Export to a variant of a page for viewing from mobile devices can easily be added, but this has not yet been done when this is written.

The content management system also has a forum and chat facility which can be used by both developers and external visitors.

The system has a compare facility, which shows the changes between two versions of the same page in a specific language, and a facility to checkmark an object, while one of the editors is working on it, to prevent more than one editor from modifying the same object at the same time.

In the rest of this paper, both functions existing in KOM2002 and functions not yet implemented will be described.

3 Multilingual information

Examples of information which needs to be translated to multiple languages:

A: Attributes of informational pages:

  1. Main text body, including links within it to other web pages on the site or elsewhere.
  2. Titles.
  3. Questions.
  4. Answers.
  5. Source references.
  6. Author name.
  7. Meta-description (for search engines, some of which want a short summary of each page).
  8. Meta-keywords (used by some search engines).
  9. Classification for the natural-language question-answering system.

B. Other texts:

  1. Synonym lists used by the natural-language question-answering system.
  2. Stop lists used by the natural-language question-answering system.
  3. Export templates. The data base information for each page is inserted in such a template to produce the page shown to external users.
  4. User interface pages and phrases, including the home page for each language.
  5. Experience with multilingual content management

Traditional content-management systems combine the data base entry for a page with a template to produce a page to be shown to visitors. Figure 2 below shows that for multilingual content, there is a need for two stages of templates, a language-independent template, which is combined with text in each language to produce templates in each languages, which then are used in a second step to produce web pages. Note that some images are language-independent and should then be in the language-independent meta-template. Some images contain text, and should then be included in the text which is used to produce the language-dependent template for each language.

Figure 2: At the top shows how a single-language CMS uses language-dependent templates to produce web pages. At the bottom is shown that there is also a need for language-independent templates, which are combined with template text in each language to produce specific templates for each language:

diagram of templates
diagram of templates

Each web site has its own data structures of objects linked in different ways. At the leaf end of these data structures are often texts which have to be translated for a multilingual web site. Some editing operations will only change the structure, and not the text leaves. A good multilingual system should allow editors to make such operations only once, and have immediate effect in all languages, by separating language-dependent texts from language-independent structure.

Figure 3: Separation of language-independent structure from language-dependent texts:

Figure 3 shows an example of a hierarchical structure, where the structure is language-independent and only the names in the leaves need to be translated. But the same principle applies to all kinds of structures. If, for example, a web page contains a section with links to related pages within the web site, then only the user-visible strings need to be translated, the structural linking-information need not be translated.

A good web site should on each page include links to other related pages. This is important, since visitors to web sites very often click on links (see Figure 10 below). An example of the bottom of a page with links is shown in Figure 4, which shows a web page with links to related pages at the bottom. In addition to the links to related pages, there are also two buttons, "Find a few related answers" and "Find many related answers" which gives lists of even more related pages. These lists are actually produced by the QuickAsk natural-language question-answering system, using the title of the page as query string.

Figure 4: Example of page with links to related pages below the main text:

Figure 4: Example of page with links to related pages below the main text:

Symptoms/Signs of Anorexia Nervosa ; Anorexia Symptoms; What is Anorexia

Intelligent natural language question-answering in the area of psychology and psychiatry. Ask a simple question  Local help Info

Go the top of the page Top Forum iconDiscuss this Forum iconGet personal advice Printer Print
Question(s):
Written by: Gunborg Palme, certified psychologist and certified psychotherapist, teacher and tutor in psychotherapy.
First version: 26 Nov 2006. Latest revision: 16 May 2008.

Describe the symptoms of eating disorders like Anorexia Nervosa. What are the main signs of anorexia nervosa?

Answer:

... omitted text ...

It can also depend on an addictive condition where starvation stimulates the body's reward centre. More.

... omitted text ...

These links from one page to another page should be the same for all languages, with only the visible text changed from language to language. Links to an English version of a page, should of course in the Swedish version link to the Swedish version of the same page, but this can easily be achieved with relative URLs, so that the actual URLs used can be the same for all languages. This can either be implemented so that a change in the list in one language automatically changes the list in all other languages, or so that a special command is available to export a changed link list to new languages.

Note that this only applies to links within the web site. Links to other web sites usually link to pages only available in one language, so such links should be different for each language and usually not be exported to other language versions of the same page.

The same principle applies to links within the text, like the underscored grey word "More" in Figure 4 above.

Most web sites also have a category structure, containing menus of links to other pages. This structure is such, that it is possible, starting at the home page, to find all pages on the site by just clicking on links. This is important, since as shown in Figure 10 below, users prefer clicking on links to using site-internal search engines. In Web4Health, we have two formats for such menus or category index pages. One format is used for main menus containing lots of links, and looks like in Figure 5 below:

Figure 5 - Example of format for a menu containing a large number of links:

Anxiety
Panic, Frightened
Diagnosis of Anxiety Disorders
Causes of Anxiety Disorders

Another format is used for smaller menus containing a fewer number of links, shown in Figure 6 below:

Figure 6 - Example of a menu containing a small number of links:

Suicide Help, Suicide Facts, Suicide Prevention Awareness

Intelligent natural language question-answering in the area of psychology and psychiatry. Ask a simple question  Local help Info

Go the top of the page Top Forum iconDiscuss this Forum iconGet personal advice Printer Print

If you are considering suicide, always first talk to someone with whom you can discuss your problems. Many countries have special phone numbers you can call to discuss your situation when considering suicide.

Web4Health does not give any advice on suicide, but below are some informational articles.

The structure of these menus can be the same in all languages, and a good multilingual CMS can then store this structure, so that a change in such a menu in one language can either automatically be exported to other languages, or exported with a special command.

Note that no translation is necessary for menu items, since the texts can be taken from the objects the menu refers to. It should, however, be possible to modify this text, for example to have a shorter text in the menu than the title of the page it refers to. In Web4Health, we have chosen to store in the data base for each page a separate title and text to be used in menus. But even so, it should be possible to modify the menu text separately for each menu. For example, in a part of a large menu like this

the title "Anxiety" need not be repeated in each menu item, but may be included in the full title of the page the menu item refers to.

One should note that the category index pages need not be hierarchical. Some CMS use the hierarchical directory system of a computer as a basis for the category indexing; this is not good since there can be more than one path to the same page, and since different users think in different ways, this will make it easier for users to find information (see Figure 7 below).

Figure 7 - This example shows that the category index need not be a hierarchical structure. The page "Anxiety suppression" can be reached in more than one way from the top menu:

Diagram showing that from the top menu, an item about 'anxiety suppression' can be reached either via a path through 'eating-disorders-causes' or a path through 'addiction-alcohol' or via a path 'anxiety'.
Diagram showing that from the top menu, an item about 'anxiety suppression' can be reached either via a path through 'eating-disorders-causes' or a path through 'addiction-alcohol' or via a path 'anxiety'.

A web page itself is also structured into paragraphs, numbered lists, and other HTML objects. This structure is usually the same for each language, as shown in Figure 8 below.

Figure 8 - This figure shows the hierarchical structure of an HTML document, where the structure usually is the same in all languages, and only the texts need to be translated:

Diagram showing the hierarchical structure of a HTML document, with head-body-ol-li-text item.
Diagram showing the hierarchical structure of a HTML document, with head-body-ol-li-text item.

When one paragraph or list item is changed in one language, a work flow task to translate this change to other languages is produced. This work flow task should indicate which paragraph has been changed, and in which way, to help the translator move the change to other languages. This can be accomplished by storing each HTML object as an object in a HTML-structured representation of the page. This will also help translators, because they need not see the HTML encodings, only the texts on the objects to be translated.

Structured storage, where structure is independent from text, is especially important when step-wise improvements are done to the content of a web site. A change of the structure should then automatically be available in all languages, not only in the language in which the change is made. This is very important in order to retain high quality in a site where step-wise improvement is done.

What is described above can either be done such that the structure (menus, paragraphs, etc.) is identical for all languages, or it may allow some such items to be different for different languages. Examples of items which may be different for each language are lists of links to other web sites. A good multilingual CMS may then be designed so that the normal case is to have identical structure in all languages, but allow certain structural items to be particular for only one language.

Other editing operations need creation of one or more new texts. A good multilingual system should allow such operations with only one single operation to create the object in one initial language, plus added operations to translate the new texts to each target language. Our experts usually either write their original texts in their native language, and then translate it to English, or write the original texts directly in English. Other experts can then translate it, usually from English to their native language.

When visitors view the web site in their native language, the system can either be designed so that they will only see texts available in their native language, or so that texts which have not yet been translated are shown in another language, usually English. We have chosen the second option - whenever a text is not available in the native language, the English version is shown instead. A third alternative might be to let Google or some other machine translation engine translate the texts from English, possibly as a temporary measure until a human has made a better translation. We have chosen not to do this, since some people are offended by machine translations. This seems to be a personal thing, some people think such translations are quite useful, even if not always perfect translations, others cannot accept imperfect language at all.

Often, the task of creation of structure is done by other people than those who translate the texts to different languages. Some work flow functionality is then useful. The most important work flow functionality is a tool, by which a translator can find which new texts need translation. Some systems use English as a master language from which all translations are produced, other systems allow any language to be used as a source for a translation to any other language.

We defined our system with the goal of having the choice of language at the outermost structure of the data structures, as described above. The system mostly adheres to this principle. However, the system does not yet have built-in support for language-independent handling of the hierarchical subject trees (taxonomies). This means that creation and translation of such structures is not at present as easy as it should be.

As much information as possible should be specified in only one language. Thus names of objects which are only visible to the developers are always in English. Only the texts visible to users need to be available in multiple languages. This makes translation easier. For example, the synonym lists used by the natural-language question-answering system have all the names for the synonyms in English, only the values need vary between languages (see Table 1).

Table 1 - Part of the synonym list. Note that the names of the synonyms, visible only to the developers, are in English for all the languages:

Synonym name Value in English Value in German Values in other languages
$adhd [ad ; hd*] adhd* ahdh* adhs* [a ; d ; h ; d] [d ; a ; m ; p] [a ; d ; d] [attention; deficit; hyperactiv*; disord*] hyperactiv* hypercinet* addh* damp* adhs* twitch* [aufmerksamkeits ; defizit ; syndrom] [ad ; hs] [a ; d ; h ; s] ads adhs adhd hks hyperaktivitaet* zappel* hampeln hyperkin* unaufmerksam* ablenkbarkeit* impulsivit* verhaltensstö* entwicklungsstör* unkonzen* abgelen* tagträum* träumer* chaosprinz* zerstreut* ...
$advantage advantag* pro pros prefer* benefi* asset* gain* favor* favour* positiv* good* triumph* succe* excel* bevorzugung dienlich*, einträglich*, ergiebig* ertrag frucht gewinn interes* nutz* oberhand oberwasser plus, vorzu* vorteil* guenstig* posit* gut* besser erfolg* ...
$anorexia ana anore* anero* anere* anerx* anorx* starv* [no not ;hung*] undernourish* fast fasting fasted tiny little petite weigh* apath* slinky lean meager gaunt lanky skinny famine famish* drought unfed meager* thin gracil* svelt* willow* thin slender* [low : weight] slender* [not ; want* wish* desir* like* : to ; $eat] [refus* declin* reject* rebuf* : to ; $eat] underweig* anorexie* magersucht* magersuecht* magersuecht* duerr* kachex* kachekt* hager schmal ausgehunger* ausgemergel* unterernaehr* untergewi* abgezehrt spindelduerr abgemage* abgez* arid dünn dürr gertenschlank hager knochendürr knochig, kümmerlich rappeldürr schlank, schlankwüchsig schmächtig spindeldürr hohlwangig ...

5 Cross-lingual natural-language question-answering

As described above, the natural-language question-answering method we have used means that we have to produce question-matching templates for each page. These templates also often need to be updated, based on entries in the usage logs where the system did not provide the best answer to a certain question. The work of developing and managing these templates require a special competence. Not even an ordinary professional translator can do it without a few days of instruction on how to create such templates.

Because of this, it is an advantage if only some of the people need to have this particular competence. Also, it is very important that a change in these templates can be done in one language, and the result be immediately available for natural-language question-answering also in other languages.

We have implemented this, using a technique called cross-lingual natural-language question-answering [1], [5]. How this works is shown in Figure 9. The figure uses Italian, but Italian can be replaced by any other language, for which a machine-translator to English is available. If no machine-translator is available, word-for-word dictionary look up may also give acceptable results.

Figure 9 - Cross-Lingual Natural-Language Question-Answering:

Diagram showing that a question in Italian is both put to the Italian-language question-answerer, and, after translation by Google to English, to the English-language question-answerer, and that the matches so found are merged and shown to the person asking the question.
Diagram showing that a question in Italian is both put to the Italian-language question-answerer, and, after translation by Google to English, to the English-language question-answerer, and that the matches so found are merged and shown to the person asking the question.

Incoming questions are translated by Google machine translation to English. The English question is then put to the English-language answering engine. When the results have been found, the corresponding native language objects are shown. This could be implemented so that the user never sees that any other language than his own is involved. We have chosen, however, to show the English answer if the text of the answer has not yet been translated to English. This means that users will sometimes see some English answers after their native language answers.

It is also possible to set up this process without having any translated answers. This will allow users to ask questions in their native language, but get the answer in English. Since many people handle English better as a passive than as an active language, this would be a useful tool for them.

We also have some texts which are only available in the native language, since each national editor can add texts which are only available in his/her own language. For these texts, a native-language question-answering system is used to find answers.

We have compared the quality of the answers found in this way to question-answering directly in the language of the questions [6]. These comparisons indicate that taking the Google machine-translation engine as is, the quality will be somewhat inferior to that of direct language answering. However, if the dictionary used by Google is extended with the terminology suitable for our subject area, the quality will be almost as good as with direct language answering.

The reason for this is that the standard Google dictionaries are designed for office documents, not health. For example, the word "body" is by Google translated as if it meant "main part", which is the most common use of this word in office documents, but which, of course, is usually not suitable when talking about health.

One might argue that augmenting the dictionary with new terminology is as much work as writing the classification separately in each language. However, this is not true, because the same dictionary entry can be used in the classification of many answers. For example, the dictionary entry for "cause" can be used in many pages discussing causes of various disorders. Another important advantage with cross-lingual question-answering is that development of the dictionary does not need the special competence needed for doing the classification. Thus, cross-lingual question-answering allows a separation of tasks between people with different competences.

6 Work flow and news control

In a multilingual web site, different people perform different tasks. In our case, medical experts write the texts. Each text is also checked by another medical expert than the original author. They are translated either by other medical experts, or by other translators, where the translation is checked by a medical expert. The classification and most of the structuring is done by linguistic or computer-science experts. The same text often has to pass through many hands before being finally published in each target language. It is then important to have so-called work flow support.

Our system has a work-flow system, which can be configured for different work flows. The most complex work-flow presently supported has the following states:

  1. The Swedish medical expert writes a Swedish answer to an English question.
  2. A translator translates the Swedish answer to English.
  3. The Swedish medical experts checks the translation for correctness.
  4. The translation is checked by an English-language medical expert.
  5. The translation is published.

The work flow system should make it easy to see the state of an answer, and to get a list of all answers which are in a special state. It should also give content developers easy feedback on when some work should be done, and reminders when this has not been done.

The content management system also has tools to aid the transformation of an answer to a personal user question, written by a medical expert, into a general answering page which other users can find when searching the data base. This transformation involves creating templates for the natural-language question-answering system and other similar activities. Long questions are often abbreviated at this stage. The system marks which such user questions have been converted, and which informational page is a conversion of which answer to a specific user question.

Work flow notes the stage of each piece of information, what further action is needed, and notifies the appropriate person who is to perform a certain act. Knowing what each person is expected to do, and reminding them of tasks left to do, is known under the term "news control" and has some similarities to the capabilities of mail programs of knowing which messages a person has not yet read.

7 Attracting visitors

Figure 10 - How visitors move into, out of and inside Web4Health:

Diagram showing that of entry pages, 72 % come from search engines, and 28 % from other causes, and for internal movements between pages, 61 % go from one internal page to another, 12 % go to QuickAsk and 27 % leave the web site.
Diagram showing that of entry pages, 72 % come from search engines, and 28 % from other causes, and for internal movements between pages, 61 % go from one internal page to another, 12 % go to QuickAsk and 27 % leave the web site.

Figure 10 shows that most visitors to Web4Health come from search engines like Google. Inside Web4Health, most users move to other pages by clicking on a link, only 12 % of the visitors use the QuickAsk natural-language question-answering system.

In total, 25 % of all page views in Web4Health occur because someone clicked on a link in a Google search result, and only 12 % of all page views occur because someone clicked on a link in a search result from the QuickAsk internal question-answering system! Probably people are accustomed to bad internal search engines on most web sites and therefore do not use the one we have.

An important conclusion from this is that it is twice as important to help Google users find pages on our web site, as to help users find pages using our internal natural-language question-answering system. And since 61 % click on links in the web site, having a good set of links between pages is also more important than optimizing the QuickAsk system!

To help Google users find pages, we have been using two commercial data bases called WordTracker and KeywordDiscovery. Wordtracker contains three million search engine queries made during the last 2 months. For each page in Web4Health, you can select suitable search strings which are common according to WordTracker, and which are appropriate for the page. These search strings are then inserted in the title, and once or twice more, in the text, since Google mainly indexes web pages by their title. KeywordDiscovery has a much larger data base, buts all entries not so recent. We also use the log files of our own natural-language question-answering system, in order to know which are the most popular questions.

Example: One of our pages had the title "How children react to trauma". WordTracker showed that not a single query in its data base used this query string. However, WordTracker showed that "effects of child abuse" and "psychological effects of child abuse" was a popular search string, and which was also appropriate for this page. Thus, we added this to the title of the page, and inserted the same string twice in the text of the page. This gave this page much more visitors from Google and other search engines than before this change.

Changing your web pages, so that Google will more often show them, belongs to an area known as Search Engine Optimization (SEO). Some SEO methods are unethical, methods which try to cheat Google into showing a page too often, and when that page is not appropriate for the query. The SEO we have done, with the aim of showing our pages when these pages are appropriate for the query, is, however, not unethical. It is instead very important to increase the value of the web site. Since twice as many people get to one of our pages from external search engines than from our own search tool, it is in fact twice as important to ensure that our pages are optimized for search engines than to ensure that our own search tool is optimized!

Figure 11 - Number of visitors/month to the Web4Health web site

diagram showing how the visitors per month have steadily increased from 40000 in 2004 to 900000 in 2008.
diagram showing how the visitors per month have steadily increased from 40000 in 2004 to 900000 in 2008.

Figure 11 shows how the number of visitors to Web4Health has increased since the opening of the site in July 2003 until April 2008. The work we have done on SEO has probably contributed to this increase in the number of visitors.

One important reason why Web4Health has so many visitors is simply that the site has so many informational pages, and each page translated to multiple languages. One of the most important principles of Search Engine Optimization (SEO) is to optimize for many search string. And the multilinguality multiples the number of pages, and thus also the number of search strings which will find each page.

8 Left to do

Here is a short list of functions which we are not ready with, but which we think would be an improvement to our tool:

  1. A WYSIWYG (What You See Is What You Get) editor like Dreamweaver or Google Documents for editing texts and translations. Since the content developers use a web-based interface, this tool should be an applet, so that users need not install additional software on their computers. We have developed two such editors, one written in Javascript, the other written in Java, but have not yet got them working in our production environment.
  2. More complete support to the principle of separating structure and text is needed. In particular, links in pages to other pages should be easy to export to other languages, as soon as such a link has been created in other languages. This includes involving the right translator for each target language.
  3. A spell checker, so that also badly spelled questions can be answered. Again, we have this developed, but not yet entered into our production environment.
  4. A tool to ease the migration of extensions of an already existing answer to other languages. This tool should show to the translator exactly what has been changed in the source language version of an answer, when translating this change to the translation. We already have the tool to show differences between versions, but have not yet implemented it in a neat way for migration of changes from language to language.
  5. We have found that for workflows with long chains of steps taken by different people, there is a risk that objects get stuck. Thus, it would be useful for a tool to find or remind system administrator when an object is stuck in a step of an unfinished workflow, and to remind content developers when some task is expected of them. We have such a facility, but it needs improvements.
  6. The multilingual question-answering system would work better, if Google was given a special dictionary of terms used in question templates.
  7. Investigate use, wholly or partly, of algorithms for automatically generating new index pages with collections of hyperlinks that are currently published [13] and automatic selection of the optimal set of hyperlinks for a web site's portal page [14].

Just now, we do not have funding for doing these improvements, but we hope to get it in the future.

9 Conclusions

The main conclusion of this development is that it is important to design multilingual systems so that language-independent structure is clearly separated from language-dependent texts, so that changes in the language-independent structure can be done for all languages in one operation. This is achieved by putting the texts at the leaves of the data structures, and designing the system so that each such text-leaf can easily be specified in multiple languages and easily be translated. Important is also support for the work flow between different people doing different tasks, such as writing texts and doing the translation to different target languages.

10 References

[1]Cross-Language Evaluation Forum - CLEF, by Michael Kluck, http://www.gesis.org/ en/ research/ information_technology/ CLEF_DELOS.htm
[2]Professional Content Management Systems: Handling Digital Media Assets, by Andreas Mauthe and Peter Thomas, ISBN: 0-470-85542-8, Wiley March 2004.
[3]Web4Health Complete Final Project Report, by Jacob Palme, July 2004, http://web4health.info /documentation / D-7-4-full-final-rep.pdf
[4]Natural Language Question Answering System Classification Manual by Jacob Palme and Eriks Sneiders, http://web4health.info/ documentation/ D 2-2b-classification.pdf
[5]CLEF - Cross-Language Evaluation Forum, by Carol Peters, http://www.ercim.org/ publication/ Ercim_News/ enw40/ peters.html
[6]Sjödin, Elin: Preliminary Results (extracts from a forthcoming Master's thesis, Stockholm University).
[7]Access to the online translation engine, by Elsa Sklavounou, KOM2002 project report D 8.1 December 2002, http://web4health.info/ documentation/D-8-1-translator-access.pdf
[8]Automated FAQ Answering: Continued Experience with Shallow Language Understanding. Question Answering Systems by Erik Sneiders. Papers from the 1999 AAAI Fall Symposium. Technical Report FS-99-02, November 5-7, North Falmouth, Massachusetts, USA, AAAI Press, pp.97-107 at http://www.dsv.su.se/ ~eriks/ Sneiders1999.pdf
[9]Short sample of: The CMS-Report Web Content Management Products & Practices, by Tony Byrne, CMS Watch http://www.cmswatch.com, autumn 2002.
[10]Content Management Systems: Getting from Concept to Reality, by C. Kartchner, The Journal of Electronic Publishing June 1998, Volume 3, Issue 4
[11]Content Management Bible, by Boiko Bob, John Wiley & Sons; 1st edition, December 2001)
[12]Ideh Alikhani & Bushra Al Hamdan: UtvŠrdering av hŠlsosajter (in Swedish, title translated to English: "Evaluation of health web sites"), Master's thesis at DSV June 2005.
[13]Perkowitz, M. and O. Etzioni (1998). Adaptive web sites: Automatically synthesizing web pages. In Proc. of the Innovative Applications of Artificial Intelligence Conf., pp. 727-732.
[14]Fang, X. and O. R. L. Sheng (2004). Linkselector: A web mining approach to hyperlink selection for web portals. ACM Transactions on Internet Technology 4, 209-237.
[15]Wallraff, B. 2000. "What global language?" The Atlantic monthly 286(5)
[16]Parker, R. 1995. Mixed signals: The prospects for global television news. New York, NY: Twentieth Century Fund Press.
[17]Pargman, Daniel and Palme, Jacob: ASCII Imperialism. In Standards and Their Stories: How Quantifying, Classifying, and Formalizing Practices Shape Everyday Life, by Susan Leigh Star and Martha Lampland, eds. Cornell University Press 2009.