In general, the ability to process input from different modality devices adds inherent robustness to a system, since the interpretation of the user's communicative acts can be based on input from different channels: errors in one channel can be compensated for by the information coming from another channel. On the other hand, the use of additional modalities is also likely to introduce ambiguities and uncertainty which the system must be able to deal with. Moreover, multimodality puts heavy demand on flexible interaction modelling, since having to merge input from different sources also requires extensive reasoning capabilities both in understanding and responding to partial, erroneous, multi-channel input. The way in which different modalities reinforce or complement each other, is still poorly understood and exploited. Modality integration is thus a compelling research topic. In fact, the representation and integration of multimodal messages in dialogue systems is also a very important issue for commercial parties. And on the generation side, a dialogue system's combination of speech with other modalities to present information to the user, also poses a number of unsolved problems regarding the choice, timing and consistency of the output.
To avoid building systems where the processing of multimodal inputs and generation of multimodal outputs is implemented as a series of idiosyncratic procedures tailored to specific tasks, standards and generic methodologies for modality integration should be studied. Such methodologies should enable systems to make the most of the redundancy introduced by multimodality and, so to speak, find (or present) the right information in the right thread. Furthermore, they should make multimodal systems more easily scalable and portable across domains.
The aim of this network is to contribute to the development of such standards, architectures and methodologies - and to a deeper understanding of how language and other modalities best complement each other in computer interfaces - by bringing together research institutes working with multimodal interaction in the Nordic countries. The relevance of such a network becomes clear when one considers various relevant activities currently undertaken in the Nordic countries. These include research projects of national and European scale, courses (e.g. 7th European Summer School on Language and Speech with the theme Multimodality in Language and Speech Systems held in Stockholm in 1999 and arranged by KTH), and basic research carried out at individual research institutes. Furthermore, we believe funding a Nordic network on multimodal interaction is relevant to the Nordic language technology research programme - and in particular to the theme Human-computer interaction in natural language - not only because language is one of the modalities used, but also because techniques from NLP can be expected to play a major role in models of multimodal integration. In this respect, it is interesting to note that the growing interest for multimodal interaction is opening a new perspective to Nordic research on dialogues, which is already acknowledged internationally.
The creation and running of a network on multimodality cannot be achieved by the individual efforts of the interested institutes. In order to produce fruitful and useful results, coordination of the work is needed, and joint activities must be planned, organised and seen through. Thus although willingness to participate and need for such a network already exist, an operational network requires that a basic infrastructure for coordination and management, and financial support of joint activities are in place. We believe that by providing this financial support, the Nordic language technology programme would contribute to the development of a very promising area in which Nordic research stands a good chance of achieving remarkable international results.
A central issue, and one where language technology research results may be capitalised on, is that of multimodal integration. A promising approach put forward by several researchers is in fact that of using techniques known from NLP (see Johnson et al. 1997). A similar distinction to that made in NLP between grammar rules and parsing algorithms can be made between a multimodal grammar and an algorithm for applying the grammar to input from multiple modalities. By upholding this separation of process and data, the process of merging inputs from different modalities can be made more general, as the entire representation becomes media-independent and any procedures defined for modality integration within the processing stages are then generally applicable regardless of which input modalities originate the information in question. Finally, defining algorithms for modality integration independent of the specific modalities used in a particular application, also increases the chances that components of the system can be extended and/or reused. For example in the Danish research project Staging, Center for Sprogteknologi (CST) has developed a multimodal dialogue interface to a virtual environment (see Paggio et al. 2000) where speech, keyboard and gestural inputs are merged by a feature-based parser. These results will be shared with the other network participants, and extended as a result of CST's engagement in the network. Another Danish partner, the SMC group from Aalborg, also has extensive research and teaching experience in the area of multimodality, complemented with expertise in speech processing.
Another promising approach to modality integration is the use of different machine learning techniques and especially such techniques as neural networks, which have already been successfully applied e.g. to speech recognition and various classification tasks. As has been the case with many other application domains, also for multimodal integration hybrid systems mixing rule-based approaches with machine learning algorithms may well provide the most interesting results. Although rule-based methods in general work reasonably well, it is a well-known problem that an explicit specification of the steps, i.e. rules that are required to control the processing of the input, is a difficult task, and when the domain becomes more complex, the rules become more complex too. Often the correlation between input and output is difficult to specify. This is the case e.g. with multimodal interfaces, and thus approaches which are both robust and able to adapt to new inputs are needed. Expertise in this domain is brought to the network by the Media Lab at the University of Art and Design in Helsinki (UIAH), and especially its Soft-Computing Interfaces Group which is devoted to designing adaptive interfaces and developing tools for human-machine interaction, relying on nature-like emergent knowledge that arises from subsymbolic, unsupervised processes of self-organizing nature (see e.g. Koskenniemi et al. 2001, Jokinen et al. 2001). One of Media Lab's goals is also to explore the impact of new digital technology in society, and to evaluate, understand and deal with the challenges it poses to the design of information technology products. In this, multimodality plays an active role in opening new possibilities for communication, interaction, education and expression, and the network will provide an important channel for planning and integration of matters relating to interactive media.
To fully exploit multimodality in various interfaces, it is important to know how the neurocognitive mechanisms support multimodal and multisensory integration. In comparison to that devoted to single sensory systems, there has been very little research on the integration mechanisms of information received via different senses. However, the research group of Cognitive Science and Technology at the Helsinki University of Technology is using various methods to uncover the integration principles of auditory and visual speech. On the basis of the results, mathematical models of the integration are being developed. The group is also developing a Finnish artificial person, a talking and gesturing audiovisual head model. The model will be used in practical dialogue systems, and also serve as a well-controlled stimulus for neurocognitive studies.
In a similar way as the rule-based integration of modalities can be enhanced using machine learning techniques, results obtained through pure probabilistic analysis methods may well be boosted by the addition of symbolic rules. An example relevant to multimodal interfaces are the algorithms for character and word prediction used in connection with eye-tracking, where the system tries to guess what the user is "typing with the eye?. Although the performance of the probabilistic approaches implemented in current systems is promising, language technology techniques seem to constitute a valuable add-on. This is an issue that the group at the IT University of Copenhagen is working on.
A third research issue regards the interpretation of multimodal input and the generation of multimodal output in relation to a dialogue model and to a model of the domain and task at hand. Several language technology institutes in the Nordic countries have contributed substantially to dialogue research, and developed dialogue models as well as implemented dialogue systems. Notable examples are the Department of Linguistics at the University of Göteborg, the Natural Interactive Systems Laboratory (NISLab) at the University of Southern Denmark, and the natural language processing research group (NLPLab) at the University of Linköping, all of which will participate in the network. The Göteborg group has extensive experience in corpus collection and dialogue management. They have developed tools for spoken language analysis and coding which can be applied to the collection and analysis of multimodal dialogues, thus providing empirical basis and insight for research on multimodal interaction: how different modalities are used in human-human communication (Allwood, 2001). NISLab has a strong background in dialogue management, components and systems evaluation, and spoken dialogue corpus coding, having led the EU projects DISC and DISC2 (1997-2000) on best practice in the development and evaluation of spoken language dialogue systems and components (see www.disc2.dk), as well as the EU project MATE (1998-2000) which developed the MATE Workbench for multi-level and cross-level annotation of spoken dialogue. NISLab is currently in the process of generalising the DISC and MATE results by addressing best practice in the development and evaluation of natural interactivity systems and components (in the EU project CLASS, 2000-2002), surveying data resources, coding schemes and coding tools for natural interactivity (EU-US project ISLE, 2000-2002), and building the world's first general-purpose coding tool for natural interactive communicative behaviour (EU project NITE, 2001-2002). NLPLAB at Linköping University has for almost two decades conducted research on dialogue systems and now has a platform for development of multimodal dialogue systems for various applications to be developed further towards an open source code repository (Degerstedt & Jönsson, 2001). Currently focus is on integrating dialogue systems with intelligent document processing techniques in order to develop multimodal dialogue systems that can retrieve information from unstructured documents, where the request requires that the user, in a dialogue with the system, specifies their information needs (Merkel & Jönsson, 2001). AT KTH several multimodal dialog systems have been developed. The first system Waxholm was a multimodal system exploring an animated agent (Carlsson & Granström, 1996).Current work and interest involves research on multimodal output using animations and also to some extent multimodal input using both speech and pointing (Gustafson, et. al, 2000).
Another branch of research includes the development of generic
technology resources in an open source code repository. This involves a
method for development of dialogue systems (Degerstedt and
2001), as well as design of generic system architectures. The Jaspis
developed at the TAUCHI group at University of Tampere (Turunen and
2000) provides an agent-based flexible development platform which has
been applied to various dialogue applications.
More concretely, the network will organise a number of activities:
and multidisciplinary oriented networks like the Nordic Interactive
and its NIRES research school which concentrate on interactive digital
MUMIN will complement these networks by focussing on multimodal and
technology aspects of the interaction. It will also seek contact with
ACL/ISCA Special Interest Group on Discourse and Dialogue SIGdial,
current president is Professor Laila Dybkjær from NISLab.
We expected 28 PhD students to participate in the network, distributed among the participating countries as follows: 10 in Denmark, 9 in Finland, 9 in Sweden. However, we expect the network participants to attract a larger number of students than those formally ?registered?. The network will also support PhD students' visits to other Nordic countries.
Throughout the whole period: email discussions and web site maintenance. The network will also be present at NodaLida.
Bernsen, N. O. (2001) Multimodality in language and speech systems - from theory to design support tool. Chapter to appear in Granström, B. (Ed.): Multimodality in Language and Speech Systems. Dordrecht: Kluwer Academic Publishers.
Bernsen, N. O., Dybkjær, L. (2001) Combining multi-party speech and text exchanges over the Internet. Proceedings of Eurospeech 2001, pp. 1189-1192.
Carlson, R., Granström, B. (1996). The WAXHOLM spoken dialogue system. Acta universitatis Carolinae philologica 1, pp. 39-52.
Degerstedt, L. and Jönsson, A. (2001). A Method for Iterative Implementation of Dialogue Management , IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, Seattle.
Gustafson, J, Bell, L, Beskow, J, Boye, J, Carlson, R, Edlund, J, Granström, B, House, D & Wirén M (2000) AdApt - a multimodal onversational dialogue system in an apartment domain, In Proc of ICSLP 2000, Beijing, 2:134-137
Johnston M., Cohen P.R., McGee D., Oviatt S.L., and Pittman J.A. (1997) Unification-based multimodal interaction, in Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, Madrid, Spain, pp. 281-288.
Jokinen, K., Hurtig, T., Hynnä, K., Kanto, K., Kaipainen, M., and Kerminen, A. (2001). Dialogue Act classification and self-organising maps. In Proceedings of the Neural Networks and Natural Language Processing Workshop , Tokyo, Japan.
Koskenniemi T., Kerminen A., Raike, A. and Kaipainen, M., (2001) Presenting data as similarity clusters instead of lists. In Proceedings of the 1st International Conference on Universal Access in Human-Computer Interaction, New Orleans, USA.
Merkel, M. and Jönsson, A. (2001). Towards multimodal public informations systems, Proceedings of 13th Nordic Conference on Computational Linguistics, NoDaLiDa '01, Uppsala, Sweden.
Nivre, J., Tullgren, K., Allwood, J., Ahlsén, E., Holm, J., Grönqvist, L., Lopez-Kästen, D., and Sofkova, S. (1998). Towards multimodal spoken language corpora: TransTool and SyncTool. Proceedings of ACL-COLING 1998.
Paggio P:, Jongejan B. and Madsen C.B. (2000). Unification-based multimodal analysis in a 3D virtual world: the Staging project. In Proceedings of the CELE-Twente Workshop on Language Technology: Interacting Agents, pp. 71-82.
Sams, M., Manninen, P., Surakka, V., Helin, P. and Kättö, R. (1998). Effects of word meaning and sentence context on the integration of audiovisual speech. Speech Communication, 26, 75-87.
Sams, M., Kulju, J., Möttönen, R., Jussila, V., Olivés J-L., Zhang, Y., Kaski, K., Majaranta, P., Räihä, K-J. (2000). Towards a high-quality and well-controlled Finnish audio-visual speech synthesizer. Proceedings of The 4th World Multiconference on Systemics, Cybernetics and Informatics (Sci'2000), Orlando, Florida (USA).
Turunen, M. and Hakulinen J. (2000). JASPIS - a Framework for Multilingual Adaptive Speech Applications. In Proceedings of the 6th International Conference of Spoken Language Processing. ICSPL 2000.