Interactive Dance Music II

July 5, 2021
Software, Sound
max, HCI, machine learning

Aims & Overview #

This post serves as background and technical explanation for the Interactive Dance Music II system submitted for a final year 40 credit Music Technology module. It will describe the aesthetic and technical ideals researched to build an interactive music system capable of analysing and reproducing sound-motion relationships, and the historical precedent for this practice. It will also delve into the motion capture techniques used to track movement, and the machine learning techniques which correlate separate gestures and physical patterns to the extracted sound features played simultaneously with the recorded movement, taking into account its temporal and acoustic properties and reproducing the learnt patterns via granular synthesis.

Aesthetically, this project considers the action-perception feedback loop, defined in embodied music cognition, as a fundamental design principle for deciding movement-sound mappings. As such, the relationship between action and sound is derived from the actions of the dancer in demonstration; descriptions of actions are stored after correlation with sound features, which are subsequently used to generate sounds in response to a dancer. Demonstrations of the system will be available in subsequent posts.

Little aside – a eureka moment that struck me while iteratively improving this project and looking through others’ explorations of expressive musical control was that an instrument is a device to translate body movements into sound. This may seem obvious, but it took a while for me to realise that even through my tapping on a table, or using cutlery as percussion, the conscious exercise of my kinaesthetic self was what produced acoustic music. As such the IDMII system is one built so as to further explore how we perceive our relationship with sound, as it forces us to examine which movements we associate with particular timbres, notes, and rhythms, and through this examination, adapt. I will finish this aside with a quote from Michael Hawley, a digital-music pioneer:

The change, of course, is that instruments are no longer relatively simple mechanical systems. Instruments now have memory and the ability to receive digital information. They may render music in a deeply informed way, reflecting the stored impressions of many instruments, halls, performers, compositions, and a number of environmental or contrived variables. At first, digital technology let us clumsily analyse and synthesise sound; later, it became possible to build digital control into instrument interfaces, so that musician’s gestures were captured and could be operated upon; and now, we can see that it will eventually be possible to compute seamlessly between the wave and the underlying musical content. That is a distinct departure from the traditional idea of an instrument.1

What is interesting about the trend is that it injects intelligence and autonomy into instruments. Their function is no longer “only to transmit” or transduce: they become sources and interpreters of deeper musical information in their own right.

Background on Digital Instrument Design and Embodied Cognition #

Current digital technologies are playing an increasingly important role in the development of contemporary aesthetics, whether in music or dance. Digital Music Instruments have a long history starting, amongst others, with the glove-based sensor translation project The Hand by Michel Waisvisz.2 This work inspired a plethora of performers to achieve interactivity with technology in their artistic work, combining sensor mappings with visual and sound parameters.3 Modern advances in computer vision, miniaturised sensors and data processing have allowed artists to focus on Human-Computer Interaction (HCI) in capturing movement data and using it to manipulate datasets, particularly sound. This focus on movement started with the Theremin in 19204 and has been a major driver for computer vision technology, likely due to its questioning of the traditionally reactive sound-motion relationship.5 Music and dance are often seen as separate art forms, yet in their essence are tied forms of human expression. Dances such as the waltz, with a ¾ time signature corresponding to the number of steps performed, clearly demonstrate a translation of musical parameters onto physical ones taking into account the musical dimensions of rhythm, meter, and form.6 This relates to the embodied view of music cognition, which aims to refute the computational perspective of musical perception by stating that our understanding is as much sensorial and kinaesthetic as related to our formal understanding of music theory.7

Design and Functionality #

My first attempt at constructing a Max-based mapping device aimed at the incorporation of embodied interaction into the compositional process, failing to account for the agency and preferences of the dancer within the system. Rather, it directly related movement on cartesian axes of certain body parts with specific sound parameters such as the Cutoff Frequency of Ableton Live’s Auto Filter device with a simple drop-down menu. This was one of several instruments designed to test this direct mapping concept but proved contrived due to the limited technical ability of the program in defining and recognising movement, as well as the complexity of making the dancer understand the functionality of relatively complex effect chains.

versionOne

Recent advances in HCI technology have promoted the concept of Interactive Machine Learning (IML), which focuses on improving algorithmic inferences through user intervention, empowering users with clear methods for interactive design –– Mapping by Demonstration. 8 Fiebrink’s Wekinator, released in 2010, was a seminal tool for this field in its approach considering user decisions at several steps in the process, allowing editing of training examples, as well as parameter tuning and evaluation through live interaction.9 Fiebrink’s paper on play-along mapping describes the definition of training examples via user demonstration: listening to the sounds and moving along to them. The work within the IDM project was inspired by these concepts, and surely could not have been produced without Fiebrink’s theoretical framework, and the extensive implementation of multi-modal models developed by Francois in the MuBu library. As such, the goal of an embodied approach toward the design of interactive music software is to acknowledge that the user must be provided with the tools to define their own mappings, their own embodied musical relationships based on their perceptions of chosen sounds.

Mapping by Demonstration #

This instrument presents an application of Mapping by Demonstration (MbD) in an attempt to unify gesture and sound through an embodied approach to interactive musical instrument design. This is achieved using probabilistic models that consistently encode sound relationships based on movements defined by the user to create a multi-dimensional encounter, placing the user within the action-perception loop. The algorithmic relationship generated between sound and music is henceforth referred to as the mapping, formalised by Rovan.10 A central tenet in the design process with respect to embodied cognition, was the notion of a multi-dimensional encounter in which the sound and movements of the user evolve with each other in a feedback loop explored on voluntary action by the user. As such, I propose a method of harnessing machine learning algorithms to give the user the ability to provide examples, train the algorithm, evaluate outcomes, and modify sound parameters in real-time within a single software environment.

Motion Sensing #

The first step in creating an ’example’ to train the model is using motion capture techniques, in this case the Kinect V2 motion sensor from Microsoft. NI Mate is the driver software used for interfacing with the device – Apple stopped supporting OpenNI in 2014, and so it provides one of the few Mac compatible motion capture input suites. Communication between NI Mate and Max is performed over the OSC protocol on port 7000. IRCAM’s CNMAT External is used to route the midi messages according to their structure: each message comes in the form – /joint_Name_1 [X] [Y] [Z] – and thus is sorted according to name and side, before having the axes separated via the unpack object.

Recording Multi-Modal Data #

An essential object from the MuBu library is the iMuBu multi-track container, which represents multiple temporally aligned homogenous data streams compatible with the SDIF standard (allowing the import and export of synchronous sound descriptors, motion capture data, and sound).11 As the MuBu library objects are compiled from a C++ library, it is not possible to open them as sub-patchers to view the sequence of their internal processes. However, the clear command structure existent in the function definitions makes it easier to understand their contents and purpose. The declarations are not chronological, rather applying to the iMuBu in which the audio, audio descriptors, and motion datasets are stored to be processed. The mubu.track kinectLearn 1 audio object defines the audio capture section of the buffer, initialising it for recording with a specified size, @maxsize of 24 seconds and visualisation method (audio waveform). Instances of this object also exist to define the audio description buffer and movement buffer. 12 audio description parameters are used, shown by the @matricols 12 command. The MuBu for Max Reference contains the attributes that are referred to with these commands.

Processing and Synthesis #

The mubu.xmm object from the MuBu for Max library used to correlate movement and sound requires the movement to be separated into temporally aligned sequences of motion parameters and audio descriptors, with the storage mechanism – the iMuBu Buffer – shown below after movement and sound has been recorded.12

iMuBu Buffer

The mubu.xmm object is not intended to map directly between motion parameters and audio, but rather is made to map between sequences of motion parameters and sound descriptors. As such, after audio is imported it must be processed via the pipo~ mfcc object (launched automatically upon recording). PIPO~ is used to extract MFCCs from the audio stream; audio descriptors which characterise the shape of the spectrum of the sound via coefficients, known as MEL cepstrum coefficients, requiring the calculation of STFT frames, MEL coefficients and DFT coefficients.13

After sound and motion are recorded, the mubu.xmm object is used to map between the movement and the MFCC sound descriptors. When the train mode is engaged, the motion parameters are sent to mubu.xmm, which learns the relationship between the gesture and the description of the sound. At runtime, motion parameters are sent to the input of mubu.xmm which generates MFCCs based on the associated input movement (and its variations from the original input). These MFCC descriptors can then be re-synthesised using mubu.granular~ in conjunction with mubu.knn. In this context, the KNN algorithm creates a multi-dimensional search tree of the input MFCCs and movement data to try and find the most efficient comparison with the live input and its closest sound descriptor match.14

Multimodal Hidden Markov Model for Temporal Correlation #

The XMM library differentiates between multi-modal and movement models, as well as instantaneous and temporal models. As opposed to movement models, multi-modal models are trained with sequences of mappings that enable it to predict their relationships. As such, these probabilistic models allow for generating sound features through resynthesis, by taking motion data input. Temporal models, in opposition to instantaneous models which instantly provide prediction independent of previous input, take into account the ‘time series’, or previous input.15 To be able to provide a consistent reconstruction of the input sounds on user movement, a temporal model was needed. The only one which matches from the library is the Multimodal Hidden Markov Model, which simultaneously accounts for temporal evolution of sound as well as their relationship based on the given examples.

Conclusion #

To merge gesture and sound seems an intuitive human faculty yet one which resists duplication due to the complexity of embodied cognition and its depths yet to be uncovered. This project, through background research in Human-Computer Interaction, experimentation, and libraries, has resulted in a body sampler which allows spatial and temporal exploration of pre-recorded sound. Rooted in the notion of embodiment, it allows the user to question their senses and explore what it means to dance, and move to music.


  1. Michael Hawley, “Structure out of Sound”, (MIT Press, 1983), pp. 55. ↩︎

  2. Atau Tanaka and Marco Donnarumma, “The Body As Musical Instrument”, in The Oxford Handbook Of Music And The Body (Oxford University Press, 2018), pp. 1-20. ↩︎

  3. Guiseppe Torre, Kristina Andersen, and Frank Baldé, “The Hands: The Making of a Digital Musical Instrument”. (Computer Music Journal, 2016). ↩︎

  4. Mathew, Kasey, “Strange Vibrations: The Evolution of the Theremin” (2019). Capstone Projects and Master’s Theses. 463. ↩︎

  5. Hugo Scurto, “Designing With Machine Learning For Interactive Music Dispositifs” (Sound [cs.SD] Ph.D, Sorbonne Université, 2020). ↩︎

  6. Wayne Siegel, “Dancing The Music: Interactive Dance And Music”, in The Oxford Handbook Of Computer Music (New York: Oxford University Press, 2020), pp. 191-213. ↩︎

  7. Marc Leman, “An Embodied Approach to Music Semantics”, (Musicae Scientiae, 2010), pp. 43-67.9 ↩︎

  8. Jules Francoise, “Motion-Sound Mapping By Demonstration” (PhD, Universite de Pierre et Marie Curie, 2015). ↩︎

  9. Rebecca Fiebrink and Perry R. Cook, “The Wekinator: A System For Real-Time, Interactive Machine Learning In Music”, Princeton University Press, 2010. ↩︎

  10. Joseph Butch Rovan, Marcelo Wanderley and Shlomo Dubnov, “Instrumental Gestural Mapping Strategies As Expressivity Determinants In Computer Music Performance”, Anaysis-Synthesis Team/Real-Time Systems Group, IRCAM, 1997. ↩︎

  11. Norbert Schnell, Axel Robel and Diemo Schwarz, “Mubu And Friends { Assembling Tools For Content Based Real-Time Interactive Audio Processing In Maxmsp”, IRCAM, 2021, 1-4. Website↩︎

  12. Jules Francoise, “All Time Posts”, 129.102.1.155, 2021. Website ↩︎

  13. Norbert Schnell, Diemo Schwarz and Joseph Larrade, “Pipo, A Plugin Interface For Afferent Data Stream Processing Modules”, HAL (Archives Ouvertes), 2017, 1-5. ↩︎

  14. Mubu.Knn (http://imtr.ircam.fr/imtr/images/Mubu.knn.maxref.pdf, 2017). ↩︎

  15. Jules Francoise and Nortbert Schnell, “Probabilistic Models For Designing Motion And Sound Relationships”, HAL (Archives Ouvertes), 2021, 1-4. ↩︎