Main

Robots have the potential to perform useful physical tasks in a wide range of application areas1,2,3,4. To robustly manipulate objects in complex and changing environments, a robot must be able to perceive when, where and how its body is contacting other things. Although widely studied and highly successful for environment perception at a distance, centrally mounted cameras and computer vision are poorly suited to real-world robot contact perception due to occlusion and the small scale of the deformations involved. Robots instead need touch-sensitive skin, but few haptic sensors exist that are suitable for practical applications.

Recent developments have shown that machine-learning-based approaches are especially promising for creating dexterous robots2,5,6. In such self-learning scenarios and real-world applications, the need for extensive data makes it particularly critical that sensors are robust and keep providing good readings over thousands of hours of rough interaction. Importantly, machine learning also opens new possibilities for tackling this haptic sensing challenge by replacing handcrafted numeric calibration procedures with end-to-end mappings learned from data7.

Many researchers have created haptic sensors8 that can quantify contact across a robot’s surfaces: previous successful designs produced measurements using resistive9,10,11,12,13, capacitive14,15,16, ferroelectric17, triboelectric18 and optoresistive19,20 transduction approaches. More recently, vision-based haptic sensors21,22,23,24,25,26 have demonstrated a new family of solutions, typically using an internal camera that views the soft contact surface from within; however, these existing sensors tend to be fragile, bulky, insensitive, inaccurate and/or expensive. By considering the goals and constraints from a fresh perspective, we have invented a vision-based sensor that overcomes these challenges and is thus suitable for robotic dexterous manipulation.

Table 1 provides a detailed comparison of representative state-of-the-art sensors. We highlight the most important differences and refer the reader to the Methods for a more thorough examination. The mechanical designs of all previous sensors employ multiple functional layers, which are complex to fabricate and can be delicate. Insight is the only sensor with a single soft layer. Many tasks benefit from a large three-dimensional sensing surface rather than small two-dimensional sensing patches; however, only a few other sensors offer three-dimensional surfaces25,27,28,29. Some of them require special lenses25 or use multiple cameras27, whereas others are more fragile28,29. Insight needs only a single camera and simple manufacturing techniques. Depending on their mechanical design, sensors also have widely varying sensing surface area and sensor volume. We provide area per volume (A/V) in Table 1 as a measure of compactness and find that Insight is among the most compact vision-based sensors with the largest sensing surface.

Table 1 A comparison between state-of-the-art haptic sensors and our design (Insight, highlighted in bold). An upward (downward) arrow indicates that higher (lower) values are better

Most existing sensors provide only localization of a single contact20,25,27,28,30; some also provide a force magnitude9,23,31 without force direction. Others are specialized for measuring contact area shape21,29,32. Although real contacts will be multiple and complex, a spatially extended map of three-dimensional contact forces over the surface, which we call a force map, is only rarely provided (for example, ref. 22). Insight is the only sensor that provides a force map across a three-dimensional surface such that a robot can have detailed directional information about simultaneous contacts. Many sensors rely on analytical data processing22,25,28,33, which requires careful calibration; it is difficult to obtain correct force amplitudes with such an approach as materials are often inhomogeneous and the assumption of linearity between deformation and force is often violated. Data-driven approaches such as those used with a BioTac9, GelSight21, OmniTact27 and Insight can deal with these problems but require copious quality data.

This paper presents a new soft thumb-sized sensor with all-round force-sensing capabilities enabled by vision and machine learning; it is durable, compact, sensitive, accurate and affordable (less than $100). As it consists of a flexible shell around a vision sensor, we name it Insight. Although initially designed for dexterous manipulation and behavioural learning, our sensor is suitable for many other applications and our technology can be adapted to create a variety of three-dimensional haptic sensing systems.

Figure 1 shows the principles behind the design of Insight. The skin is made of a soft elastomer over-moulded34 on a hollow stiff skeleton to maintain the sensor’s shape and allow for high interaction forces without damage (Fig. 1b). It utilizes shading effects35 and structured light36 to monitor the three-dimensional deformation of the sensing surface with a single camera from the inside (Fig. 1c). The sensor’s output is computed by a data-driven machine-learning approach10,12, which directly infers distributed contact-force information from raw camera readings, avoiding complicated calibration or any handcrafted post-processing steps (Fig. 1d).

Fig. 1: The design principles of Insight.
figure 1

a, The overall structure of the sensor with its hybrid mechanical construction and internal imaging system. For comparison, the sensor is shown in a human hand next to the corresponding camera view. b, The pure elastomer (left), the stiff hollow skeleton (middle) and both over-moulded together (right). ci, The internal lighting using a translucent shell: the LED ring with apertures creates light cones, visualized by their projections on flat horizontal planes. cii, The light projection patterns within, as seen by the camera in the undeformed opaque shell. d, The data processing pipeline. The machine-learning model is trained on data collected by an automatic test bed; T and R stand for translation and rotational test-bed movements, respectively. Each data point combines one image from the camera with the indenter’s contact position and orientation, contact force vector and diameter, which are used to calculate a ground-truth force distribution map from an approximate model under consistent contact forces.

We evaluate Insight against several rigorous performance criteria. When indented by a hemispherical tip with a diameter of 4 mm at a force amplitude of up to 2 N, the sensor can achieve an average localization accuracy and force accuracy of around 0.4 mm and 0.03 N, respectively. By directly estimating both the normal and shear components of each applied force vector, the sensor reaches an average directional estimation error of around 5°. Moreover, in the absence of contact, Insight is sensitive enough to recognize its posture relative to gravity based only on the deformations caused by its own self-weight, which are not detectable by the human eye.

Principles of operation and design

At the core of our design is a single camera that observes the sensor’s opaque over-moulded elastic shell from the inside (Fig. 1a). Photometric effects and structured lighting enable it to detect the tiny deformations of the sensor surface that are caused by physical contact. Instead of computing the contact force vectors numerically from the observed deformations using elastic theory33,37, which poses limiting assumptions, we use a machine-learning approach that translates images directly to force distribution maps. The details are shown in Fig. 1 and explained below.

Mechanics

We aim at a compliant and sensitive sensing surface due to the favourable properties of soft materials for manipulating objects38, for safer interactions around humans39, and to limit the instantaneous impact forces during unforeseen collisions40. Nevertheless, soft materials alone cannot withstand large interaction forces and are deformed by gravity and inertial effects41.

To ensure a compliant sensing surface, high contact sensitivity and robustness against self-motion, we design a soft-stiff hybrid structure using over-moulding (Fig. 1b)34. The structure is composed of two parts: a flexible elastomer to sense contact and an aluminium skeleton to support the sensing surface. The resulting sensor is not only sufficiently structurally stable to keep its overall shape under high contact forces, but it is also sensitive enough for gentle interaction forces to cause local deformations. By contrast to our approach, other successful curved vision-based sensors such as GelTip28 and that in Romero and colleagues29 solve the stability problem with a smooth and uniform support structure out of transparent glass, acrylic or resin29, allowing for good imaging quality and acting as a light guide. Our metal skeleton can be designed independent of the lighting.

Insight’s shell is hollow so that the entire system is lightweight; avoiding direct contact with any optical elements (by contrast to ref. 27) also reduces the chances of image distortion and system damage. Constructing a single elastomer layer that serves all purposes is a simple, compact, robust and wireless solution for haptic sensing. All other vision-based haptic sensors are built from multiple layers of different materials (for example, protective coatings, marker sheets, elastomers, adhesive sheets). Durability issues often arise due to non-permanent attachment between layers (for example, refs. 21,27,28,42). Another design consideration is the opaqueness of the elastomer. Our elastomer layer is opaque enough to block all interference from ambient light, ensuring reliable output even under bright lighting conditions. Sensors that have a thin opaque coating on top of transparent elastomer may struggle to achieve this property and/or maintain it over long-term use. To demonstrate the sensitivity of our approach, we include a thin, flat area of elastomer near the sensor’s end for higher-resolution perception of detailed shapes (akin to a tactile fovea).

Imaging

Two main techniques can be used to obtain three-dimensional information from a single camera. Photometric stereo35 uses multiple images of the same scene with varying disparate light sources from different illumination directions to infer the three-dimensional shape from shading information. Structured light36 is a single-shot three-dimensional surface-reconstruction technique that uses a unique light pattern and the fact that its appearance depends on the shape of the three-dimensional surface on which it is projected. Photometric stereo is generally better at capturing local details, whereas structured light is used for coarser global reconstruction43,44. Photometric stereo is most effective when the illumination is nearly parallel to the surface, where the normal vectors of the deformed surface can be finely reconstructed from shading information21. Sensors built on photometric stereo21,28,42 employ light guides to create this desirable tangential surface illumination, which is challenging for highly curved sensing surfaces28,29. Structured light allows for more perpendicular lighting of the surface and improves with larger disparity between the light source and camera.

Insight is unique among haptic sensors as it combines photometric stereo and structured light to detect the deformation of a full three-dimensional cone-shaped surface in the single-camera single-image setting. Light-emitting diode (LED) sources around the camera produce distinct light cones (eight in our prototype, as shown in Fig. 1c). The lighting direction is adjusted through a collimator to introduce a suitable structured light pattern that favours locally parallel lighting for photometric stereo, as depicted in Extended Data Fig. 1c. In contrast to the light guides of other sensors21,28,29,42, our collimator allows for flexible lighting and is independent of the support structure. When an area of the sensor surface is contacted from the outside, the surface orientation changes, which causes a difference in colour intensity through shading. The surface displacement also changes the distance of the surface to the camera, which can be detected with structured light cones due to the colour change per pixel.

Haptic information

Sensors can capture many types of haptic information such as vibration45, deformation12,21,46, undirected pressure distribution10 and directional force distribution22,33. For robotics applications, a directional force distribution is the preferred form of contact information, as it describes the location and size of each contact region, as well as the local loading in the normal and shear directions47. Our proposed sensor is designed to deliver precisely this type of contact information, that is, a three-dimensional directional force distribution over a three-dimensional conical sensing surface represented by a fine mesh of points, where each point has three mutually orthogonal force elements.

In a classical estimation chain, the force distribution is inferred from the surface displacement using a linear stiffness matrix based on elastic theory37. This approach is poorly suited for our design (as discussed in the Methods) so we employ a data-driven method to estimate the force distribution directly from the raw image input using machine-learning techniques, namely an adapted ResNet48, which is a favoured deep CNN architecture. To collect reference data to train the neural network, we built a position-controlled five-degrees-of-freedom (DOF) test bed with an indenter that probes the designed sensor. A six-DOF force-torque sensor (ATI Mini40) measures the force vector applied to the indenter so we can simultaneously record ground truth forces and corresponding images from the camera inside the sensor. The target force distribution map corresponding to each contact is computed by a simple spatial approximation using the known force vector, contact location, and indenter diameter. A subset of all data is used to train the machine-learning model. The entire process is illustrated in Fig. 1d and detailed in the Methods.

Fabrication

As depicted in Fig. 2, fabrication of Insight includes three main aspects: the imaging system, mechanical components and optical properties. An explanation of the design choices and further details of the fabrication process can be found in the Methods.

Fig. 2: The fabrication process of Insight.
figure 2

a, An exploded view of Insight with all parts in the design; bold items were custom-fabricated. b, The materials, processing steps and intermediate outcomes for the elastomer, the skeleton frame, the moulds for over-moulding, the collimator and the over-moulded sensing surface. c, The partially assembled Insight and an image captured under no contact.

Performance

The sensor’s performance is evaluated with respect to both accuracy and sensitivity. The first measure of accuracy is direct single-contact estimation: a contact force needs to be localized, and its magnitude and direction must be inferred. Second is force distribution estimation for single contact: the contact area and directional force distribution over the entire sensing surface are inferred. We also provide qualitative results for multiple contacts. Finally, we evaluate Insight’s sensitivity by studying whether it can perceive gravitational effects and characterizing its ability to detect shapes contacting the tactile fovea.

Accuracy of direct contact estimation

Our primary way to assess accuracy is to quantitatively evaluate the system’s ability to localize contacts and measure the applied force. First, we use a hemispherically tipped indenter with a diameter of 4 mm to probe a large number of points distributed across Insight’s sensing surface (Fig. 3ai). In this procedure, we collect the images under contact and the contact force vectors from the force sensor, as well as the position of each contact on the sensor’s surface using our five-DOF test bed (Fig. 1d). The histogram of the applied forces in Fig. 3aiii shows that most contacts have magnitudes smaller than 1.6 N, as we set this value as the threshold of data collection to avoid damaging the sensor. We then train a machine-learning model (modified ResNet48 structure) to infer the contact information. The inputs to the model are the image under contact, a static reference image without contact, and a static image of the stiff skeleton for inhomogeneous elasticity encoding (recorded before over-moulding in a dark environment). The outputs are the three-dimensional coordinates of the contact in the sensor’s reference frame and the three-dimensional force components expressed in the local surface coordinate frame, as depicted in Fig. 3aii. Details on data collection and machine learning are summarized in the Methods.

Fig. 3: Single contact performance with direct estimation.
figure 3

a, The estimation pipeline for inferring single contact position and force. ai, The real experimental set-up in which the test bed probes Insight and collects data. aii, The machine-learning model: the inputs are three images (raw image, reference image and skeleton image), and the outputs are the contact location (Px, Py and Pz denote the coordinates of the contact in the sensor’s reference frame) and contact force vector (Fs1, Fs2 and Fn). aiii, A histogram of the forces applied in the data collection procedure. b, Statistical evaluation of the sensor’s performance on the test data. bi, The localization and force estimation performance grouped by applied force magnitude. The red-, green- and blue-coloured half-violins show the distribution of deviations in the x, y and z directions, respectively. The force is predicted relative to the surface in normal direction Fn and two shear directions Fs1 and Fs2. The orange half-violins stand for the resulting total errors. bii, The spatial distribution of the localization and force quantification errors for the same test data.

We evaluate the single-contact direct estimation accuracy of localization and force sensing for an applied force magnitude up to 2 N, as shown in Fig. 3bi. All reported numbers are for test contact points that do not appear in the training data. The overall median localization precision is around 0.4 mm, and the force magnitude precision is approximately 0.03 N in the normal and shear directions. The force direction is estimated with a precision of approximately 5°. Notice that the test bed has an overall position precision of 0.2 mm, and the force-torque sensor has a force precision of 0.01/0.01/0.02 N (Fx/Fy/Fz). Insight’s accuracy in localization is remarkably stable over different force ranges, whereas the error in force amplitude slightly increases with higher interaction force. For strong applied forces (over 1.6 N), the force accuracy becomes worse, presumably as we have insufficient training data for this domain (histogram in Fig. 3aiii). Another explanation is that high forces occur most often at locations near the stiff frame (Fig. 3bii), which deforms only a little. There is no noticeable difference in the localization and force accuracy in the sensor frame’s x, y and z directions.

We particularly evaluate Insight’s accuracy at localizing test contact points, as shown in Fig. 3bii. The accuracy is stable across the entire surface, and higher errors appear near the stiff frame. Only areas near the camera show a systematic performance drop; as our camera has a 4:3 aspect ratio, it cannot see two opposite areas at the base of the shell, below the lowest ring of the stiff frame.

Accuracy of force map estimation

To infer contact areas and multiple simultaneous contacts, we now consider the distribution of contact force vectors across the entire surface, which we call a whole-surface force map. Altogether, the force map yields valuable information for robotic grasping and manipulation, for example, for slip detection, in-hand object movement and haptic object recognition.

Insight has a three-dimensional curved surface and thus needs to output a force map with the same shape. We create a fine mesh of 3,800 points spanning the entire surface with an average spacing of 1 mm. Each point has three output values describing the force components it feels in the x, y and z directions expressed in the reference frame of the sensor.

Similar to the direct contact estimation, we also employ a machine-learning-driven pipeline. Instead of the six-dimensional output (Fig. 3aii), the network now produces the approximate force distribution map (Fig. 4ai) using only convolutional layers. The map is estimated as a flat image with three channels (Fx, Fy, Fz) to describe nodal forces (the individual force on each point) in the x, y and z directions, respectively, mimicking the red, green and blue channels in a colourful image. Each pixel in the image corresponds to one point in the force map. The correspondence is established using the Hungarian assignment method49, which minimizes the overall distance between pixels and points projected to the two-dimensional camera image, as shown in Fig. 4aii. Training the machine-learning model from collected data also requires target force distribution maps (Fig. 1d). As they are not measured directly, we approximate the force map applied by the indenter by distributing the measured total force locally across the surface. From a set of five diverse candidates, as detailed in Supplementary Section A.3, the approximation yielding the best performance in localization and force magnitude accuracy is selected (see Extended Data Fig. 2 and Supplementary Table 5).

Fig. 4: Performance evaluation of the force map.
figure 4

a, The pipeline of estimating the force distribution; the ResNet network transforms three images (raw, reference, skeleton) into the xyz force map image (ai), and its pixels are mapped to points on the sensing surface (aii). b, The quantitative evaluation of the performance for force amplitude, force direction and contact area size inference grouped by applied force amplitude. c, The data flow and estimated force map when the sensor is pinched and rotated by two fingers.

The quantitative estimation accuracy for the force amplitude and direction is reported in Fig. 4b, grouped by force magnitude. The evaluation is based on the comparison between the three-dimensional force vectors summed across the predicted force map and the ground-truth force vectors using the same single-contact dataset. The median error in inferring the total force is around 0.08 N, and the error grows with increasing force (Fig. 4bi and Supplementary Fig. 4). The system’s tendency to slightly underestimate larger forces is probably caused by our force map approximation method, the influence of the skeleton, and the machine-learning method itself, which tends to estimate smooth force distributions rather than peaked maps. An ablation of the skeleton image as input leads to worse underestimation (Supplementary Section A.6), supporting part of this hypothesis. The median error in inferring the force direction is around 10° for low contact forces, and it decreases to 5° with higher applied forces (Fig. 4bii). Moreover, we can also localize the contact with a precision of around 0.6 mm based on the force map by averaging the locations of the 20 points with the highest force amplitudes (Supplementary Fig. 4). Supplementary Section A.6 analyses how this performance depends on the amount of data and the type of input provided to the network.

The contact area is estimated by identifying the points with predicted forces larger than 0.02 N. The diameter of this contact area increases with higher applied force and tends to overestimate by about 1 mm for a 4 mm indenter at high forces. Insight possesses a nail-shaped zone with a thinner elastomer layer (1.2 mm) and a sensing area of 13 × 11 mm2, as indicated in Fig. 5b. The median position and force errors in the tactile fovea are 0.3 mm and 0.026 N over an applied force range of 0.03–0.8 N, which shows better position accuracy and force accuracy than other sensing areas.

Fig. 5: Evaluation of Insight’s sensitivity to self-posture and detailed shapes.
figure 5

a, Quantitative evaluation of the sensor posture recognition. ai,ii, The experiment set-up (i) and the inference procedure for posture (roll and yaw angles) (ii). The network maps the difference between the current image and the reference image to posture coordinates. aiii, The pixel-wise root-mean-square error (RMSE) of the image difference as the sensor is rotated to all possible roll and yaw angles. aiv, A statistical summary of the posture estimation accuracy: the yaw angle estimation performance with a yellow fitted curve (left) and the roll angle estimation accuracy under different yaw angles (right). b, A qualitative evaluation of the tactile fovea for shape detection. bi, The experimental set-up for applying differently shaped probes to the high-sensitivity region. bii, The system’s perceptual limits for sharpness (v-shaped wedge with an included angle of 150) and number of edges (nine-sided polygon) with an indenter diameter of 6 mm, along with their corresponding captured raw images and respective red, green and blue channels. biii,iv, The sharpness and edge tests: included angle and edge count (upper), indenter samples (middle) and captured images under indentation tests (lower).

We use an indenter with 12 mm diameter to validate the force map inference performance and report details in Supplementary Fig. 5. The median position accuracy is 1 mm. For higher applied force, the underestimation of force magnitude is more pronounced. Force direction is measured to a high level of accuracy, achieving a median error of 8°. The median contact area estimates closely match expectations for a 12 mm indenter at each force level. As anticipated, the predicted force map is inhomogeneous and shows higher forces near the skeleton.

Multiple simultaneous contacts

We also qualitatively demonstrate the sensor’s performance during multiple complex contacts. Figure 4c shows the exemplary response to a human using two fingers to pinch and slightly twist the sensor. Each pixel of the force map contains the three force values estimated at that point. We facilitate interpretation by visualizing each contact force vector on the three-dimensional surface of the sensor. The experimenter’s counter-clockwise twisting input can be seen in the slant of the force vectors when the sensor is viewed axially. Extended Data Fig. 3 and Supplementary Video 4 show the response for other contact situations. In our experiments, the sensor was consistently able to discriminate up to five simultaneous contact points and estimate each contact area in a visually accurate manner.

Sensitivity

The final two experiments evaluated Insight’s sensitivity to subtle haptic stimuli. The sensor can accurately estimate its own orientation relative to gravity by visually observing the small gravity-induced deformations of the over-moulded elastomer (see Fig. 5a, Extended Data Fig. 4a and Supplementary Section A.3). Note that this experiment was conducted without any contacts and in a dark room to rule out other possible clues about self-posture. The median error for predicting yaw was 2.11°, and it was 4.45° for roll, with the highest errors for the roll angle around vertical, as expected. The camera was also found to capture relevant shape details when v-shaped wedges and extruded polygons were pressed into the tactile fovea (Fig. 5b).

Discussion

We present a soft haptic sensor named Insight that uses vision and learning to output a directional force map over its entire thumb-shaped surface. The sensor has a localization accuracy of 0.4 mm, force magnitude accuracy of 0.03 N and force direction accuracy of 5°. It can independently infer the locations, normal forces and shear forces of multiple simultaneous contacts—up to five regions in our evaluation. Moreover, the sensor is so sensitive that its quasi-static orientation relative to gravity can be inferred with an accuracy around 2°. A particularly sensitive tactile fovea with a thinner elastomer layer allows it to detect contact forces as low as 0.03 N and perceive the detailed shape of an object. A detailed comparison between Insight and other sensors can be found in Table 1 and the Methods.

The majority of sensors detect deformations with classical methods and use linear elastic theory to compute interaction forces. This approach requires good calibration and special care with reflection effects and inhomogeneous lighting. The linear relationship between deformations and forces is often violated for strong contacts and for inhomogeneous surfaces like the over-moulded shell of Insight. As our method is data-driven and uses end-to-end learning, all effects are modelled automatically. The downside of our approach is that it requires a precise test bed to collect reference data. Once constructed, the test bed can collect data for different sensor geometries—only a geometric model of the design is required.

The inhomogeneity of our sensor’s surface might cause unwanted effects in some applications. Robotic systems that move with high angular velocities and high accelerations will probably see tactile sensing artefacts caused by inertial deformations of the soft sensing surface; data collected during dynamic trajectories could potentially mitigate these effects.

In general, our sensor design concept can be applied and extended to a wide variety of robot body parts with different shapes and precision requirements. The machine-learning architecture, training process and inference process are all general and can be applied to differently shaped sensors or other sensor designs. We also provide ideas on how to adjust Insight’s design parameters for other applications, such as the field of view of the camera, the arrangement of the light sources and the composition of the elastomer.

Methods

We conducted several experiments to make informed design choices and validate the functionality of Insight.

Sensor shape and camera view

The sensor is cone-shaped with a rounded tip to allow an all-round touch sensation in a structure similar to a human thumb. The sensor has a base diameter of 40 mm and a height of 70 mm. The Raspberry Pi camera v.2.0 (MakerHawk Raspberry Pi Camera Module 8 MP) has a resolution of 1,640 × 1,232 and a frame rate of 40 fps. With a 160 fisheye lens, the camera’s field of view is 123.8 × 91.0. See Supplementary Fig. 1 and Supplementary Table 1 for more details on the camera. Multiple cameras are recommended if the whole sensing surface cannot be seen by a single camera, as done in OmniTact27, at the cost of increased wiring, material costs and computational load.

Light source and collimator

We use a commercial LED ring that contains eight tri-colour LEDs (WS2812 5050). The LED colours are programmed to be red (R), green (G), blue (B), R, G, R, B, G in circumferential order, and the relative brightnesses for the R, G and B light sources are 1:1:0.5, respectively. We designed a three-dimensionally printable collimator (three-dimensional printer, Formlabs Form 3; material, standard black; Formlabs owns the trademark and copyright of these names and pictures used in Fig. 2bii) with a tuned diameter (2.5 mm) and a radially tilted angle (3) toward the outside to constrain the light-emitting path and create the structured light distribution (see Extended Data Fig. 1). Detailed analysis can be found in Supplementary Section A.1.

Soft surface material, skeleton and over-moulding

Insight’s mechanical properties are optimized to ensure high sensitivity to contact forces, robustness against impact forces and low fatigue effects. We choose Smooth-On EcoFlex 00-30 silicone rubber as the mouldable material for the soft sensing surface because it is readily available and has a high elongation ratio of 900% (Supplementary Table 3). The skeleton is made of AlSi10Mg-0403 aluminium alloy, which can withstand forces up to 40 N in the shape of our prototype structure (Supplementary Table 4). These two materials are chosen based on their material data sheets and finite element analysis results50. The elastomer is cast using three-dimensionally printed moulds (three-dimensional printer, Formlabs Form 3; material, tough; Formlabs owns the trademark and copyright of these names and pictures used in Fig. 2bii), and the skeleton is three-dimensionally printed in aluminium (three-dimensional printer, ExOne X1 25 Pro; material, AlSi10Mg-0403; ExOne GmbH owns the trademark and copyright of the name and picture used in Fig. 2bi). We combine the skeleton and the elastomer without adhesive by over-moulding, as described in Figs. 1b and 2b. Due to the working principle, the moulds require no special treatment; they are used straight from the three-dimensional printer, in contrast to, for example, Romero and colleagues29. Furthermore, the manufacturing procedure is simple and requires only a single step.

The diameter of the skeleton beams and the thickness of the surrounding elastomer are optimized for robustness, as described in Extended Data Fig. 5. Finite element analysis revealed that the system’s sensitivity to contact forces improved by positioning the skeleton not in the centre of the elastomer layer but closer to the inner surface.

Optical properties

We need a material with the right reflective properties (albedo, specularity) for the sensing surface. It should not be too reflective as reflections saturate the camera and diminish sensitivity. Simultaneously, no point on the surface should be very dark, as the camera needs to detect changes in reflected light. Moreover, the material has to prevent ambient light from perturbing the image. The sensing surface is made of a flexible and mouldable translucent elastomer mixed with aluminium powder and aluminium flakes. The aluminium powder makes the surface opaque to ambient light, and the aluminium flakes adjust the reflective properties, as shown in Figs. 1c and 2b, and Supplementary Fig. 2 and Supplementary Table 2. Aluminium powder with 65 μm particle diameter and aluminium flake with 75 μm particle diameter are mixed with the EcoFlex 00-30 in a weight ratio of 20:3:400 to ensure proper light reflection properties and opacity (see Supplementary Fig. 2 and Supplementary Table 2).

Finite element analysis

To analyze the over-moulded shell’s mechanical properties during the design process, we built a finite element model using Ansys50. The model includes Young’s modulus and Poisson’s ratio for both EcoFlex 00-30 (70 kPa, 0.4999) and the aluminium alloy (73 GPa, 0.33)51.

Test bed

We created a custom test bed with five DOF; three DOF control the Cartesian movement of the probe (\(\overrightarrow{x},\overrightarrow{y},\overrightarrow{z}\)) using linear guide rails (Barch Motion) with a precision of 0.05 mm, and two DOF set the orientation of the sensor (yaw, roll) using Dynamixel MX-64AT and MX-28AT servo motors with a rotational precision of 0.09, which results in a translational precision of 0.2 mm at the tip of the sensor. The probe is fabricated from an aluminium alloy and is rigidly attached to the Cartesian gantry via an ATI Mini40 force/torque sensor with a force precision of 0.01/0.01/0.02 N (Fx/Fy/Fz). Insight is held at the desired orientation, and the indenter is used to contact it at the desired location.

Data

Measurements are collected using our automated test bed to probe Insight in different locations. To obtain a variety of normal and shear forces, the indenter moves to a specified location, touches the outer surface and deforms it increasingly by moving normal to the surface with fixed steps of 0.2 mm. For each indentation level, the indenter also moves sideways to apply shear forces (normal/shear movement ratio 2:1). After a pause of 2 s to allow transients to dissipate, we simultaneously record the contact location, the indenter contact force vector from the test bed’s force sensor, and the camera image from inside Insight. When the measured total force exceeds 1.6 N, the data collection procedure at this specified location terminates and restarts at another location. The contact location and measured force vector are combined to create the true force distribution map using the method described in Supplementary Section A.3. Images from Insight are captured using a Raspberry Pi 4 Model B and are collected and combined using a standard laptop.

Challenges with analytical data processing

Classical force estimation pipelines compute the contact forces from the surface displacement using elastic theory37. The displacement map can be acquired by analytically reconstructing the normals of the sensing surface or numerically deriving the relative movement of labelled markers from the raw image captured by the camera, as done in refs. 22,33. However, large deformations violate linearity between displacement and force. In addition, the over-moulding in our design creates an inhomogeneous surface, where the stiffness matrix is difficult to model accurately. Shear forces are visible as small lateral deformations that highly depend on the distance to the stiff skeleton. Moreover, the reconstruction of surface normals requires evenly distributed light, without shadows or internal reflections21. Tracking markers23,33,52 rather than a surface does not solve the fundamental problems with displacement-focused approaches.

Machine learning

A ResNet48 structure is used as our machine-learning model. The data for single contact includes a total of 187,358 samples at 3,800 randomly selected initial contact locations. The data set is split into training, validation and test subsets with a ratio of 3:1:1 according to the locations. The data for posture estimation from gravity contains 16,000 measurements and is split in the same way. We use four blocks of ResNet to estimate the contact position and amplitude directly (Fig. 3), two blocks to estimate the force distribution map (Fig. 4) and four blocks to estimate the sensor posture (Fig. 5). The machine-learning models are all trained with a batch size of 64 for 32 epochs, using Adam with a learning rate of 0.001 for mean squared loss minimization. Supplementary Section A.4 provides more details about the structure of the machine-learning models that we use. The performance of the models with less training data is studied in Supplementary Information A.6 (Supplementary Fig. 7).

Operating speed

The current version of Insight is not optimized for processing speed. Images are captured using a Raspberry Pi 4 with a Python script and transmitted to a host computer (with a GeForce RTX 2080Ti GPU) via Gigabit Ethernet. Images have a size of 1,640 × 1,232 and are effectively transferred at 11 fps and downsized to 410 × 308 using a Python script. The image processing with the deep network for the force-map prediction runs at 10 fps in real time. We see multiple ways to increase the operating speed, ranging from optimized code to hardware improvements (for example, using an Intel Neural Compute Stick) to choosing a faster deep network.

Comparison to state-of-the-art sensors

How does Insight compare with other vision-based haptic sensors? Table 1 lists its performance along with that of thirteen selected state-of-the-art sensors; we first give an overview and then compare the designs. One of the earliest vision-based sensors is GelSight21, which has a thin reflective coating on top of a transparent elastomer layer supported by a flat acrylic plate. Lighting parallel to the surface allows tiny deformations to be detected using photometric stereo techniques. Further developments of this approach increased its robustness (GelSlim22), achieved curved sensing surfaces with one camera (GelTip28) and with five cameras (OmniTact27), and included markers to obtain shear force information22. A different technique based on tracking of small beads inside a transparent elastomer is used by GelForce33 and the Sferrazza and D’Andrea sensor23,52 to estimate normal and shear force maps. ChromaForce (not listed in Table 1) uses subtractive colour mixing to extract similar data from deformable optical markers in a transparent layer53. The TacTip25 sensor family uses a hollow structure with a soft shell, and it detects deformations on the inside of that shell by visually tracking markers. Muscularis54 and TacLink55 extend this method to larger surfaces, such as robotic links, by using a pressurized chamber to maintain the shape of the outer shell; they are not listed in the table because they target a different application domain.

In terms of shape recognition and level of detail, the GelSight approach provides unparalleled performance. The tracking-based methods, such as GelForce and TacTip, are naturally limited by their marker density and thicker outer layer. Insight uses shading effects to achieve a much higher information density than is possible with markers, but its accuracy is also somewhat limited by measuring at the inside of a soft shell with non-negligible thickness. Beyond accurately sensing contacts, the robustness of haptic sensors is of prime importance. Without additional protection, GelSight-based sensors are comparably fragile due to their thin reflective outer coating, which can easily be damaged. Adding another layer increases robustness, but imaging artefacts were reported to appear after about 1,500 contact trials due to wear effects42. We tested Insight for more than 400,000 interactions without noticeable damage or change in performance.

Each sensing technology imposes different restrictions on the surface geometry of the sensor. Vision-based tactile sensors need the measurement surface to be visible from the inside, so there is typically no space available for other items inside the sensor. The type of visual processing also matters. TacTip’s need to track individual markers requires a more perpendicular view of the surface than shading-based approaches (GelSight and Insight). Soft materials deform well during gentle and moderate contact, but they do not withstand high forces if not adequately supported. GelSight uses a transparent rigid structure for support, which can lead to reflection artefacts when adapted to a curved sensing surface28. An alternative is high internal pressure55, but then the observed deformations are non-local. The over-moulded stiff skeleton in Insight maintains locality of deformations and withstands high forces.

To facilitate widespread adoption, tactile sensors need to be easy to produce from inexpensive components. Imaging components are remarkably cheap these days, making vision-based sensors competitive; however, GelSight needs a reproducible surface coating and permanent bonding between all layers, which are tricky to implement correctly21,56,57. TacTip needs well-placed markers or a multi-material surface that can be three-dimensionally printed only by specialized machines. Insight uses one homogeneous elastomer that requires only a single-step moulding procedure on top of the stiff three-dimensionally printed skeleton. Being able to replace the sensing surface in a modular way increases system longevity; such replacement is supported by GelSight and TacTip in principle, and it is designed to be easy in Insight, although we did not evaluate the quality of the results that can be obtained without retraining.