Monday, April 27, 2026

State of the Art: Sony AI

I had drafted an update post about how the state of the art has changed over 13 years. I'll get that out at some point. But something marvelous was released just a week or so ago.

Sony AI published their work in Nature. It was real hit to my motivation, since it is so good, and just when I had decided to come back to this project. They implemented 90% of the things that I was thinking were necessary to make a great robot.

So I want to celebrate and congratulate their work here. I'll try to explain what they did, highlighting the stuff I consider most interesting, but the best sources are their paper and the website they have devoted to it.

    Website

P. Dürr et al., “Outplaying elite table tennis players with an autonomous robot,” Nature, vol. 652, no. 8111, pp. 886–891, Apr. 2026, doi: 10.1038/s41586-026-10338-5.

Method
They built their own hardware platform with 8 degrees of freedom (dof). They use an x-y motion platform on the floor under the arm. That's big and heavy (looks like a CNC mill x-y stage), but it's also fast. Then they put a 6-dof SCARA-inspired arm on top of that. I'd call it a 3 dof shoulder, 1 dof elbow, and 2 dof wrist.

Arm diagram, from the Nature paper

They designed the arm themselves, presumably customized for speed and low-load. They talk about their finite element analysis (FEA), with topology optimization to arrive at the link geometry that maximized stiffness in required dimensions while minimizing weight. They even printed these in "Scalmalloy", which is scandium-aluminum-magnesium. That can't be cheap -- $500/kg for the raw powder.

They use 9 cameras at 200 Hz, placed all around the table, up high on truss towers. Combined, they claim 3mm position error and 10.2ms latency. The sensors are capable of 1440x1080 up to 276 Hz, with a global shutter. The paper doesn't say what pixel resolution they are reading off the camera, but 3mm is about what one camera could do in 2D at the full 1080p resolution.
Then they use 3 more cameras for spin. Rather than traditional cameras that take a complete image at a steady pace, these are event cameras, which are able to provide just the data on what changed in the image, and therefore can provide updates more frequently. In front of the cameras is a pair of galvanometers, which are little mirrors on servos, traditionally used to scan lasers very quickly. Here they are used to point the camera at the ball as it moves. They have created a CNN to turn this into the rotation of the ball at 400 to 700 Hz. Does this sounds familiar to you? It should. I blogged about this exact method in 2013 when the Ishikawa lab at the University of Tokyo invented it. Back then they demonstrated that one could see spin by drawing lines on a ball. But Sony has improved it by watching the text label already written on the ball spin past, and they actually calculate the direction and speed of the spin! I'm really jealous of this, as it's something I've been interested in since the beginning.
Spin camera system, from the Sony website (quality is probably better there too)


With this vision system they internally calculate state at 30 Hz. They train RL policies using Soft-Actor Critic (SAC). I say "policies"-plural because they make different policies for a set of "skills", each of which is designed to place the ball on the opponent's side with a certain spin (and speed and location?). They look ahead 1/30th of second to a desired position, which is turned into a desired and feasible joint position and velocity. They then break this down into position waypoints at 1000 Hz. They safety-check the next actions with traditional model-based control (not RL) and use the model-based alternative as a fallback. After striking the ball, they used the model-based control to return to a neutral ready position. Something notably missing in the Sony approach is human demonstrations. All of the learning was done in simulation. Which means they had to get their simulation very very accurate. They had to keep improving on the fluid dynamics models of the ball and how it bounces, in order to predict the future well enough.


Pros
  • It can serve! Unlike previous robots, they implemented the ability to serve properly. It has a little cup for the ball, from which it will toss the ball into the air, to strike it with the racket.
  • They play with official rules. Like all of them. Like they had an accredited referee applying the rules. They toss the serve properly. They play everywhere on the table. They use a real racket. The only modification they made was letting it use the official rules for one-armed players, which means they are allowed to use the same arm to toss the serve as to strike it.
  • They made their own really cool hardware!
  • It beats "expert" players. They found really good players (defined as "competitive athletes" with 10 years of "intensive" training) and then they beat them in real matches. They also played against professional players, but lost to them.
Cons
  • Not a lot. I'm really trying to find things to nitpick.
  • They aren't ego-centric, meaning they use a ton of cameras everywhere around the table, not just in the robot. This is very un-human and seems "unfair".
  • The robot form isn't exactly human... but it isn't that far off either. It has an arm, which is similarly articulated. Not like the Forpheus robot, which is very far from human form (but also cool).
  • They are far from portable. The robot looks like it needs a forklift. And then it needs its 12 cameras to be set up. It's safe to say they won't be bringing it to your garage to play a game.
  • Their strategy is simple, and they admit this. They choose shots one at a time, rather than chaining them together in anything intentional, either across the rally or across the game.
  • It loses to professional players. Which just means we aren't quite at the "AlphaGo moment", where robots are superhuman. But this doesn't seem far behind.

So where does this leave me? It was seriously demotivating. But I've continued my work anyway. Maybe I just want to be able to do it myself, even if I can't beat Sony. Maybe there are still some areas where I can find unique improvements. Things like portability, ego-centric vision, cost. Maybe.

No comments:

Post a Comment

Be nice, but by all means please give me feedback!