Joint Estimation of 3D Hand Position and Gestures from Monocular Video for Mobile Interaction
J. Song, F. Pece, G. Sörös, M. Koelle, O. Hilliges
ACM Human Factors in Computing Systems (CHI 2015)
Seoul, South Korea, Apr. 2015.
We present a machine learning technique to recognize gestures and estimate metric depth of hands for 3D interaction, relying only on monocular RGB video input. We aim to enable spatial interaction with small, body-worn devices where rich 3D input is desired but the usage of conventional depth sensors is prohibitive due to their power consumption and size. We propose a hybrid classification-regression approach to learn and predict a mapping of RGB colors to absolute, metric depth in real time. We also classify distinct hand gestures, allowing for a variety of 3D interactions. We demonstrate our technique with three mobile interaction scenarios and evaluate the method quantitatively and qualitatively.
In-air Gestures Around Unmodified Mobile Devices
J. Song, G. Sörös, F. Pece, S. Fanello, S. Izadi, C. Keskin, O. Hilliges
Symposium on User Interface Software and Technology (UIST 2014)
Honolulu, Hawaii, Oct. 6-8, 2014.
We present a novel machine learning based algorithm ex- tending the interaction space around mobile devices. The technique uses only the RGB camera now commonplace on off-the-shelf mobile devices. Our algorithm robustly recog- nizes a wide range of in-air gestures, supporting user varia- tion, and varying lighting conditions. We demonstrate that our algorithm runs in real-time on unmodified mobile devices, in- cluding resource-constrained smartphones and smartwatches. Our goal is not to replace the touchscreen as primary input device, but rather to augment and enrich the existing interac- tion vocabulary using gestures. While touch input works well for many scenarios, we demonstrate numerous interaction tasks such as mode switches, application and task manage- ment, menu selection and certain types of navigation, where such input can be either complemented or better served by in- air gestures. This removes screen real-estate issues on small touchscreens, and allows input to be expanded to the 3D space around the device. We present results for recognition accuracy (93% test and 98% train), impact of memory footprint and other model parameters. Finally, we report results from pre- liminary user evaluations, discuss advantages and limitations and conclude with directions for future work.
Device Effect on Panoramic Video+Context Tasks
F. Pece, J. Tompkin, H.P. Pfister, J. Kautz, C. Theobalt
Conference on Visual Media Production (CVMP 2014)
London, UK, 13-14 November 2014
Panoramic imagery is viewed daily by thousands of people, and
panoramic video imagery is becoming more common. This imagery
is viewed on many different devices with different properties, and
the effect of these differences on spatio-temporal task performance
is yet untested on these imagery. We adapt a novel panoramic video
interface and conduct a user study to discover whether display type
affects spatio-temporal reasoning task performance across desktop
monitor, tablet, and head-mounted displays. We discover that, in our
complex reasoning task, HMDs are as effective as desktop displays
even if participants felt less capable, but tablets were less effective
than desktop displays even though participants felt just as capable.
Our results impact virtual tourism, telepresence, and surveillance
applications, and so we state the design implications of our results
for panoramic imagery systems.
Video Collections in Panoramic Contexts
J. Tompkin, F. Pece, R. Shah, S. Izadi, J. Kautz and C. Theobalt
Symposium on User Interface Software and Technology (UIST 2013)
St Andrews, UK, Oct. 8-11, 2013.
Video preview (on page)
Video collections of places show contrasts and changes in our world, but current interfaces to video collections make it hard for users to explore these changes.
Recent state-of-the-art interfaces attempt to solve this problem for 'outside->in' collections, but cannot connect 'inside->out' collections of the same place
which do not visually overlap. We extend the focus+context paradigm to create a video-collections+context interface by embedding videos into a panorama.
We build a spatio-temporal index and tools for fast exploration of the space and time of the video collection. We demonstrate the flexibility of our
representation with interfaces for desktop and mobile flat displays, and for a spherical display with joypad and tablet controllers. We study with users
the effect of our video-collection+context system to spatio-temporal localization tasks, and find significant improvements to accuracy and completion time in
visual search tasks compared to existing systems. We measure the usability of our interface with System Usability Scale (SUS) and task-specific questionnaires,
and find our system scores higher.
PanoInserts: Mobile Spatial Teleconferencing
F. Pece, W. Steptoe, S. Julier, F. Wanner,T. Weyrich, J. Kautz, and A. Steed
ACM Human Factors in Computing Systems (CHI 2013)
Paris, France, April 22-May 2 2013.
Paper awarded with a "CHI 2013 Honourable Mention"
Video preview (on page)
We present PanoInserts: a novel teleconferencing system that uses smartphone cameras to create a surround representation
of meeting places. We take a static panoramic image of a
location into which we insert live videos from smartphones.
We use a combination of marker- and image-based tracking to
position the video inserts within the panorama, and transmit
this representation to a remote viewer. We conduct a user study
comparing our system with fully-panoramic video and conventional
webcam video conferencing for two spatial reasoning
tasks. Results indicate that our system performs comparably
with fully-panoramic video, and better than webcam video
conferencing in tasks that require an accurate surrounding
representation of the remote space. We discuss the representational
properties and usability of varying video presentations,
exploring how they are perceived and how they influence users
when performing spatial reasoning tasks.
Bitmap Movement Detection: HDR for Dynamic Scenes
F. Pece and J. Kautz
Journal of Virtual Reality and Broadcasting
10(2), December 2013, pages 1-13
Extended CVMP 2010 paper
Exposure Fusion and other HDR techniques generate
well-exposed images from a bracketed image sequence
while reproducing a large dynamic range that
far exceeds the dynamic range of a single exposure.
Common to all these techniques is the problem that the
smallest movements in the captured images generate
artefacts (ghosting) that dramatically affect the quality
of the final images. This limits the use of HDR and
Exposure Fusion techniques because common scenes
of interest are usually dynamic. We present a method
that adapts Exposure Fusion, as well as standard HDR
techniques, to allow for dynamic scene without introducing
artefacts. Our method detects clusters of moving
pixels within a bracketed exposure sequence with
simple binary operations. We show that the proposed
technique is able to deal with a large amount of movement
in the scene and different movement configurations.
The result is a ghost-free and highly detailed
exposure fused image at a low computational cost.
Beaming: An Asymmetric Telepresence System
A. Steed, W. Steptoe, W. Oyekoya, F. Pece, T. Weyrich, J. Kautz, D. Friedman, A. Peer, M. Solazzi, F. Tecchia, M. Bergamasco, M. Slater
IEEE Computer Graphics and Applications
32:6, 10-17, 2012.
The Beaming project recreates, virtually, a real environment; using immersive VR, remote participants can visit the virtual model and interact with the people in the real environment. The real environment doesn't need extensive equipment and can be a space such as an office or meeting room, domestic environment, or social space.
Simplified User Interface for Architectural Reconstruction
F. Wanner, F. Pece, J. Kautz
Theory and Practice of Computer Graphics 2012
Rutheford Appleton Lab. - Didcot, UK - 2012
We present a user-driven reconstruction system for the creation of 3D models of buildings from photographs. The
structural properties of buildings, such as parallel and repeated elements, are used to allow the user to create
efficiently an accurate 3D structure of different building types. An intuitive interface guides the user through the
reconstruction process, which uses a set of input images and a 3D point cloud. The system aims to minimise the
user input by recognising imprecise interaction and ensuring photo consistency.
Acting Rehearsal in Collaborative Multimodal Mixed Reality Environments
W. Steptoe, J.-M. Normand, O. Oyekoya, F. Pece, E. Giannopoulos, F. Tecchia, A. Steed, T. Weyrich, J. Kautz, M. Slater
Presence - Teleoperators and Virtual Environments (2012)
21(4), Fall 2012, pages 406-422
This paper presents experience of using our multimodal mixed reality telecommunication system
to support remote acting rehearsal. The rehearsals involved two actors located in London and
Barcelona, and a director in another location in London. This triadic audiovisual
telecommunication was performed in a spatial and multimodal collaborative mixed reality
environment based on the “destination-visitor” paradigm, which we define and motivate. We detail
our heterogeneous system architecture, which spans over the three distributed and
technologically-asymmetric sites, and features a range of capture, display, and transmission
technologies. The actors’ and director’s experience of rehearsing a scene via the system are then
discussed, exploring successes and failures of this
Towards Moment Imagery: Automatic Cinemagraphs
J. Tompkin, F. Pece, K. Subr, J. Kautz
Conference on Visual Media Production (CVMP)
London, UK, November 2011
Video preview (on page)
The imagination of the online photographic community
has recently been sparked by so-called cinemagraphs:
short, seamlessly looping animated GIF images created
from video in which only parts of the image move. These
cinemagraphs capture the dynamics of one particular region
in an image for dramatic effect, and provide the creator with
control over what part of a moment to capture. We create
a cinemagraphs authoring tool combining video motion
stabilisation, segmentation, interactive motion selection, motion
loop detection and selection, and cinemagraph rendering.
Our work pushes toward the easy and versatile creation
of moments that cannot be represented with still imagery.
Three Depth-Camera Technologies Compared
F. Pece, J. Kautz, T. Weyrich
First BEAMING Workshop
Barcelona, Spain, 14 June 2011
Adapting Standard Video Codecs for Depth Streaming
F. Pece, J. Kautz, T. Weyrich
Proc. of Joint Virtual Reality Conference of EuroVR (JVRC)
Nottingham, UK, September 2011
Cameras that can acquire a continuous stream of depth images are now commonly available, for instance the
Microsoft Kinect. It may seem that one should be able to stream these depth videos using standard video codecs,
such as VP8 or H.264. However, the quality degrades considerably as the compression algorithms are geared
towards standard three-channel (8-bit) colour video, whereas depth videos are single-channel but have a higher
bit depth. We present a novel encoding scheme that efficiently converts the single-channel depth images to standard
8-bit three-channel images, which can then be streamed using standard codecs. Our encoding scheme ensures that
the compression affects the depth values as little as possible. We show results obtained using two common video
encoders (VP8 and H.264) as well as the results obtained when using JPEG compression. The results indicate that
our encoding scheme performs much better than simpler methods.
Patents and Others