EgoTactile

Yuan Zeng¹, Yujia Shi^2,3, Tiao Tan¹, Xingting Li¹, Yaqi Qin⁴, Zongqing Lu¹, Wenming Yang¹, Jing-Hao Xue⁵, Qingmin Liao¹

¹Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
²School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
³Department of Network, Pengcheng Laboratory, Shenzhen, China
⁴JQ Industries, Qingdao, China
⁵Department of Statistical Science, University College London, London, United Kingdom
Correspondence to: Qingmin Liao <liaoqm@tsinghua.edu.cn>

Abstract

Estimating full-hand grasp pressure from egocentric video is critical for immersive VR and robotic manipulation, yet dense tactile sensing often relies on intrusive hardware. Existing vision-based methods predominantly rely on planar surfaces or fingertip contacts, failing to generalize to complex 3D object interactions. Therefore, we introduce EgoTactile, a benchmark pairing egocentric video with full-hand pressure supervision for diverse everyday objects, incorporating a bare-hand transfer subset to enable generalization to natural scenarios. Leveraging this benchmark, we first establish EgoPressureFormer as a discriminative baseline. Beyond this, to explicitly address the uncertainty in partial observations, we propose EgoPressureDiff, a conditional diffusion framework that adapts a large-scale pre-trained video diffusion backbone. By combining rich world knowledge priors with a Physically-Informed Feature Rectification layer to inject semantic constraints, our approach effectively hallucinates plausible contact patterns and resolves visual-physical ambiguities. Extensive experiments demonstrate that our method achieves superior performance on the benchmark and robust transferability to in-the-wild scenarios.

Figure 1: Task overview. Given an input RGB clip, the model predicts a pressure sequence or heatmap, optionally leveraging auxiliary Condition Info (e.g., masks and text metadata) to resolve physical ambiguities. The output is represented in two inter-convertible formats: a sparse sensor sequence and a dense spatial heatmap.

A. EgoTactile Benchmark

EgoTactile benchmarks egocentric full-hand grasp pressure prediction by synchronizing wearable RGB video with high-resolution tactile measurements. The capture system records 1280×720 egocentric video at 30 fps (head- and neck-mounted viewpoints) together with a 162-sensor pressure glove (0–350 N, ~17 Hz), collected in a controlled green-screen setup with randomized lighting and object poses to improve robustness. To support both direct supervision and real-world transfer, the benchmark contains (i) a gloved-hand set where the pressure glove is visible and paired with synchronized pressure labels, and (ii) a bare-hand set where the visible hand is bare while an off-camera gloved hand performs a synchronized grasp to provide labels, guided by a metronome with varied tempo to avoid shortcut learning. The dataset spans 63 everyday objects across 7 categories and 12 participants (balanced gender), and includes anonymized object/subject metadata (e.g., weight/material/fill state; age/body weight/body fat/hand length) for controlled analysis and optional conditioning. Benchmark protocols evaluate both object generalization (Object-Held-Out) and cross-subject robustness (Subject-Held-Out) on the gloved-hand set, and Object-Held-Out transfer on the bare-hand set.

Figure 2: Data collection setup and dataset statistics. Left: Our capture environment features controlled lighting and a green-screen background (a), and (b) illustrates the data collection scenario for bare-hand setting. To ensure viewpoint diversity and realistic transfer, we capture data using both head-mounted (c) and neck-mounted (d) cameras. Right: Statistics of the collected data, including hand contact probability (e), average pressure heatmaps (g) and force magnitude distributions across objects (f) and participants (h).

B. Baselines

Baseline I: EgoPressureFormer

As a discriminative video-to-pressure sequence predictor, EgoPressureFormer builds on a TimeSformer backbone and avoids dense pixel-aligned heatmap regression under occlusion by using a query-based decoder: learnable sensor embeddings attend to spatiotemporal visual tokens to directly predict the full M-sensor pressure sequence. Training further mitigates contact sparsity via a multi-task objective combining per-sensor classification with a frame-level contact gate. However, as a deterministic regressor, it struggles to model the multi-modal distribution of pressure under severe occlusion.

Baseline II: EgoPressureDiff

We cast egocentric pressure estimation as conditional latent diffusion to handle the inherent multi-modality caused by severe occlusion: multiple plausible pressure patterns can explain the same visible pixels, making deterministic regression ill-posed. EgoPressureDiff adapts a pre-trained Stable Video Diffusion (SVD) backbone to synthesize pressure heatmap sequences by denoising pressure latents conditioned on the RGB video, with both RGB clips and pressure heatmaps encoded into a shared latent space via SVD’s VAE and fused by channel concatenation in the denoiser. As summarized in Figure 3 (EgoPressureDiff Training Pipeline), the model further incorporates three complementary guidance signals to disambiguate contact: (i) hint masks as a structured spatial prior (processed by a lightweight Mask Encoder and injected into the latent input), (ii) text prompts encoding object/subject physical attributes, and (iii) a pressure-heatmap prototype providing an anatomical topology prior of where pressure can plausibly appear.

Figure 3: EgoPressureDiff Training Pipeline. We formulate pressure estimation as a latent diffusion process conditioned on egocentric RGB video. To resolve physical ambiguities, we incorporate multi-modal guidance via: (i) a hint mask processed by a trainable Mask Encoder, and (ii) text prompts and a prototype heatmap injected through the proposed PIFR Layer.

Physically-Informed Feature Rectification (PIFR) layer

A key challenge is that naive fusion of text (physically informative but spatially coarse) and prototypes (topologically stable but weak on magnitude) can yield unstable or physically implausible intensities, with the denoiser over-relying on the prototype. To address this, EgoPressureDiff introduces the Physically-Informed Feature Rectification (PIFR) module, which uses text-conditioned features to rectify the prototype-conditioned representation before it conditions SVD. Concretely (Figure 4), an intermediate denoiser feature map attends to prototype tokens and text tokens to form prototype-conditioned and text-conditioned features; a lightweight mapping network predicts affine modulation parameters γ and β from the text-conditioned feature, and applies them to calibrate the prototype-conditioned feature, preserving anatomical plausibility while injecting magnitude constraints implied by physical attributes.

Figure 4: (a) The original U-Net block of SVD. (b) Our proposed PIFR layer integrated into the U-Net block. Here, γ and β denote the scale and shift factor, respectively.

C. Comparisons with Baselines

Below we present qualitative hand-pressure prediction results produced by the baseline methods proposed in this paper on the same video clip. For PressureVision and EgoPressureFormer, the predicted outputs are discrete pressure classes; to visualize them as continuous heatmaps, we map each predicted class to its corresponding mean pressure value and render the resulting per-location pressure field as a heatmap.

D. Bare Hand Setting Results

The videos below showcase pressure generation by EgoPressureDiff in the bare-hand setting, demonstrating its ability to synthesize temporally coherent, anatomically plausible hand-pressure patterns conditioned on egocentric observations when the pressure glove is not visible.

Ground Truth

Input Video

EgoPressureDiff

Ground Truth

Input Video

EgoPressureDiff

E. In The Wild Results

The videos below present pressure generation results from EgoPressureDiff in an in-the-wild, uncontrolled environment, illustrating its robustness to real-world variations in illumination, background clutter, and hand–object interaction dynamics while maintaining coherent, physically plausible pressure patterns.

F. In The Wild Results on Unseen Objects

The video below shows EgoPressureDiff generating hand-pressure patterns for previously unseen objects in an uncontrolled, in-the-wild setting (an egg, a nutcracker figurine, and jasmine green tea). Notably, the model remains robust under this out-of-distribution object shift, producing temporally consistent and anatomically plausible pressure maps that qualitatively align with the expected grasp strategies induced by each object’s geometry and handling requirements.

G. Results on Unseen Grasp and Interaction Patterns

The videos below demonstrate the performance of EgoPressureDiff on unseen grasp and interaction patterns. While EgoPressureDiff can still produce coarse pressure predictions for grasping actions during the interaction process, it struggles with entirely unfamiliar interaction modes. In the future, we plan to incorporate hand pose and object 6D pose information into the modeling process, which should further enhance the model's pressure prediction capabilities on unseen patterns.

Ground Truth

Input Video

EgoPressureDiff