Multimodal Interactions


Communication is often achieved by coordinated visual and linguistic representations. Documents include images and diagrams, and face-to-face conversations are accompanied by iconic gestures. There seems to be some kind of magic that happens in multimodal communication that allows interlocutors to achieve a shared understanding. This is partly due to the systematic and conventional rules that govern the interpretation and generation of visual and spatial representations and partly due to the psychological meanings that they carry. What are these systematic rules? To what extent we can use well-developed natural language techniques to understand the organization of multimodal presentations? What is the cognitive basis for understanding and design of the diagrams? 

Although humans interpret such multimodal forms of communication effortlessly, this is a very difficult task for computers. The difficulty is due in part to our limited scientific knowledge of the structure and organization of these presentations. I work towards designing conversational systems that can synthesize multimodal presentations to convey information to human users and systems that can deal appropriately with the multimodal communication produced by people.  Here are some of my publications related to this topic. 

  1. CITE: A Corpus of Image-Text Discourse Relations, M.Alikhani, S. Nag Chowdhury, G. de Melo, M. Stone, In Proceedings of NAACL19. 

  2. "Caption" as a Coherence Relation: Evidence and Implications, M.Alikhani, M. Stone, In Proceedings of NAACL19, Workshop on Shortcomings in Vision and Language.

  3.  AI2D-RST: A multimodal corpus of 1000 primary school science diagrams, T. Hiippala, M. Alikhani, J. Haverinen, T. Kalliokoski, E. Logacheva, S. Orekhova, A. Tuomainen, M. Stone, J. Bateman.

  4. A Coherence Approach to Data-Driven Inference in Visual Communication, M.Alikhani, T.Hiippala, M. Stone, CVPR2019 Workshop on Language and Vision.

  5. Multimodal Strategies for Ambiguity Resolution in Conversational Systems, M. Alikhani, E. Selfridge, M. Stone, M. Johsnton, In submission. 

  6. Arrows are the Verbs of Diagrams, M. Alikhani, M. Stone, In Proceedings of COLING2018, the 27th International Conference on Computational Linguistics.

  7. Exploring Coherence in Visual Explanations, M. Alikhani, M. Stone, In Proceedings of First International Workshop on Multimedia Pragmatics.

Joint Actions and Common Ground


Collaborative robotics requires effective communication between a robot and a human partner. This work proposes a set of interpretive principles for how a robotic arm can use pointing actions to communicate task information to people by extending existing models from the related literature. These principles are evaluated through studies where English-speaking human subjects view animations of simulated robots instructing pick-and-place tasks. The evaluation distinguishes two classes of pointing actions that arise in pick-and-place tasks: referential pointing (identifying objects) and spatial pointing (identifying locations). The study indicates that human subjects show greater flexibility in interpreting the intent of referential pointing compared to spatial pointing, which needs to be more deliberate. The results also demonstrate the effects of variation in the environment and task context on the interpretation of pointing. The corpus and the experiments described in this work can impact models of context and coordination as well as the effect of common sense reasoning in human-robot interactions.

    1. That and There: Judging the Intent of Pointing Actions with Robotic Arms, M. Alikhani, B. Khalid, R. Shome, C. Mitash, K. Bekris, M. Stone, In Proceedings of AAAI-2020.

Generating Referring Expressions


Natural language generation is concerned with generating linguistic material from some non-linguistic material. Referring expressions are the ways we use language to refer to entities around us. How do people produce such expressions? What drives choice understanding and choice-making in producing referring expressions? How can we efficiently compute properties that are included in a description, such that it successfully identifies the target while not triggering false conversational implicatures? To generate a distinguishing referring expression, basic algorithms choose a set of attribute-value pairs that uniquely identify the intended referent given an intended referent, a knowledge base of entities characterized by properties expressed as attribute-value pairs and a context consisting of other entities that are salient.  These are my publications related to this topic.

  1. Designing Grounded Representations for Semantic Coordination, B. McMahan, M. Alikhani, M. Stone, In preparation.

Cognitive Linguistics



Effective communication depends on using language to refer to objects and entities around us. Words can flexibly refer to different ranges of continuous values in different contexts. This variability is most apparent with relative gradable adjectives such as "long" and "short". The use of these words seems to vary across people, objects, and contextual expectations. For instance, one may refer to a person as tall in a context but may not refer to the same person as a tall basketball player. How do forced choice with set alternatives affects vague terms? Will the absence of vague terms affects category boundaries of the neighboring term? How do expectations for vague terms allow for effective communication?   In the following papers, these questions are discussed.  To gain insight into people's expectations for vague words, we have looked at two vague categories, probability, and color. The results show that the flexibility of vague terms depends on how well defined their categories are. For example, basic color terms are argued to have well-defined, non-overlapping categories whereas probability terms flexibly refer to different values as a function of the available alternative choices.

  1. Vague Categories in Communication, M. Alikhani,  K. Persaud, B. McMahan, K. Pei, P. Hemmer, M. Stone, In preparation.

  2. The Influence of Alternative Terms on Speakers’ Choice of Vague Description, M. Alikhani,  K. Persaud, B. McMahan, K. Pei, P. Hemmer, M. Stone, In submission.

  3. When is Likely Unlikely: Investigating Variability of Vagueness, K. Persaud, B. McMahan, M. Alikhani, K. Pei, P. Hemmer, M. Stone, In Proceedings of the Cognitive Science Society Conference. 


Computational Social Science and Digital Humanities

Poetry explores the space of imagination beyond linguistic interpretation and pragmatics yet brings distinctive insights.  To communicate the intended meaning, a poet may recruit a broader interpretation and general knowledge of the world.  I am interested in developing natural language processing tools and techniques for studying poems and literary texts. Here are my related publications:

  1. Tracking context changes in Persian poetry, M. Alikhani*, S. Raji*, G. de Melo, and M. Stone, In submission. *equal contribution