Thursday, January 31, 2013

Where's the table?

Background subtraction has gone fairly well. It seems that it will be a decent way to find the ball. Making a good background model will reduce noise in the background. Stereo vision will make it easier to differentiate the ball from the opponent. Looking at consecutive frames will narrow down the possible locations for the ball. So I'm moving on to the next question: where am I?

I realized that even if I could identify the ball, I needed to know where it was in relation to the table and in relation to the robot. I'm going to assume that the table is stationary, that the robot's base is stationary, and that the cameras are stationary. All plus or minus some noise when the wind blows and the camera shakes.

So that means that this problem only needs to be solved once when setting up, and that solution should remain valid. I suppose I could force my setup to be rigidly measured. E.g. the robot base is 20cm behind the table at the center line, 30cm above the table, always. Yeah, that could work. In fact, I will probably do it that way first. But it would be nice if my robot could be placed in front of the table and figure that out for himself. After all, ping pong tables are fairly distinctive objects. You'd think it would be easy to identify the table in the image, apply some stereo math, and come up with the 3D model of where the cameras and robot are with respect to the table.

New Image

Here is the input image I've decided to work with. I've blurred my handsome face because I'm a privacy freak and because I'm embarrassed that my forehead looks like a ping pong ball. It has some challenging conditions: patches of bright sunshine in the background that gets reflected in the surface of the table. It's kind of a pain in the ass. It makes me think that maybe robot researchers resort to carefully controlled lighting and matte surfaces on everything. But I'm better than that. Maybe.


As a human, it sure seems easy to identify the table. Not a lot of ambiguity, despite the challenging conditions. So shouldn't a machine be able to do this easily?

Detect The Color Blue

If I'm going to be using color images (still to be decided), I should make use of the color information. My table is blue, which makes it stand out. Annoyingly my shirt is a similar blue, and my jeans are not far off. How do you even identify "blue" from an image?

I went to Matlab to experiment. Since blue is one of the RGB colors, it should be easy. However, in RGB space, if you want B to be lighter or darker, you have to add some R and G, because there is no "light/dark" element to it. "Light" happens when you have more of every color, "dark" when you have less of every color. Here's what I came up with. In short, the B channel has to dominate the other colors.

function blue = blueDetect(img)
% split the colors for clarity
r = double(img(:,:,1));
g = double(img(:,:,2));
b = double(img(:,:,3));
% calculate blueness
blue = (b - r*0.5 - g*0.5) ./ b;
% display the image, normalizing so that the most blue pixel is all the way red
image( (blue ./ max(max(blue))) * 64);

And here's what it looks like:
This uses Matlab's default "jet" color scheme: the more "hot" the color, the more "blue" I've decided it is. (I'll use this color scheme in other images but I'll be excluding the scale from now on.) So it was kind of accurate. You can apply a threshold to give a yes/no answer as to whether each pixel was blue. This is using 30% "blueness" as the cutoff:
I think this shows the real problem: some of the table isn't blue. Go back and look at the original image. On the far side of the table, the reflection makes it more white/gray than blue. My blue detector doesn't even think it's close. That's going to be a problem. A human uses other clues to decide that it was a reflection, but apparently that's not easy to do. It also seemed to get confused by the back of the paddle on the right of the image. The paddle is black. To a human, it's pretty clearly black. But I guess the way I calculate things, it's more blue than red or green, so it is bucketed as being blue.

I decided to make another blue detector. This time I decided to use abandon the RGB color space, in favor of HSV. That stands for hue, saturation, value. "Value" seems to be approximately equivalent to "light" vs "dark". I'm going to give you three images, with each of HSV and broken down, so you can see how they behave.


So the good news is that the table is a very consistent hue value. The bad news is that the background wall is almost the same hue. As is the cement below the table. But when you look at the saturation, you can see that the table is a very saturated blue, but the wall and the cement are not. So it seems that it might be possible to combine this information to identify the blue of the table.

I chose to do this as a probability distribution. I create a normal distribution for each of H, S, V that says what the mean value of "table blue" is, and how far it is allowed to stray from that ideal. When evaluating a pixel, I multiply the probability from each of the distributions together to get a combined probability of being "table blue". Here's the code.

function blue = blueDetect2(img)
% convert from rgb to hsv color space
imghsv = rgb2hsv(img);

% define the means and standard deviations in each of H S V
% hue is very tight around 39/64ths
hueMean = 39.0 / 64.0;
hueStd = 1.8 / 64.0;
% saturation is high but wide to account for glare and shadow
satMean = 42.0 / 64.0;
satStd = 11.0 / 64.0;
% value is high and wide (but this captures most of the image)
valMean = 47.0 / 64.0;
valStd = 8.0 / 64.0;

% calculate the probability for each pixel for each component
hueFit = normpdf( (imghsv(:,:,1) - hueMean) ./ hueStd );
satFit = normpdf( (imghsv(:,:,2) - satMean) ./ satStd );
valFit = normpdf( (imghsv(:,:,3) - valMean) ./ valStd );
% combine the three component probabilities together
allFit = hueFit .* satFit .* valFit;

blue = allFit;
% display the image, normalizing so that the most blue pixel is all the way red
image( (blue ./ max(max(blue))) * 64);

And here's the resulting image:
That seems to be better than the previous method. It is more sure about the table, and doesn't seem distracted by the paddle. There is less "blueness" on the wall or the rocks in the bottom right. Here is a thresholded image at 30% blueness:

Conclusion

This entry explored ways to identify the table via color. It was moderately successful, but the challenging image meant that it wasn't perfect.

I still haven't identified the corners of the table, which would be necessary for my 3D model. And there are many other approaches I could try, aside from color, to identify the table. I leave that for other blog entries.

Tuesday, January 29, 2013

OpenCV

After my first attempt at doing some vision processing in Matlab, I concluded there must be a better way. I don't want to rule out Matlab entirely. It seems that Matlab can be taken seriously for vision applications. But that would require buying the Toolbox that does that sort of stuff, and I'm not willing to put down the money on software that I might not need. I worry that anything I need to implement myself (as opposed to being built-in to the Toolbox) will be inefficient. Matlab is really only efficient for matrix operations.

OpenCV

It didn't take long to find OpenCV, which is an open source machine vision library. It implements many algorithms common in vision applications, and provides some of the framework to make C++ closer to the simplicity of Matlab. Since it does much of what the Matlab Toolbox does, and will allow me to write efficient custom implementations in C++, I think it provides more room to grow.

OpenCV was a bit of a pain to build from source. Some dependencies also had to be built from source, as the versions offered on my RHEL 6 machine were too old. It's been a week or two since I did the install, so I'm afraid I can't recount any of the details. In the end I got it installed and it seems to be working.

OpenCV also comes with Python bindings, which makes it more fun to do quick exploration work. I've been going back and forth between the two, depending on what I'm doing.

Background Subtraction

So, what can I do with OpenCV? I decided to stick with background subtraction for now. OpenCV has a more complicated approach to background subtraction, and that's probably a good thing since my quick Matlab approach had flaws.

There is more than one background subtraction algorithm in OpenCV, but I've chosen to use the one that seems the most popular and/or newest. That's BackgroundSubtractorMOG2. My vague understanding is that this method builds up a history of each pixel's color, and builds a statistical distribution around that history. Then, when you ask it to decide if a new value for that pixel is foreground or background, it compares the new value to the distribution. If it is too different from the historical distribution, it is flagged as being foreground. The MOG part is referring to a "mixture of gaussians", which means that each pixel's historical distribution is from one of a number of normal distributions. That's intended to capture different valid background states of the pixel. For example a pixel might sometimes have a leaf in it, and sometimes the leaf might have shifted out of the way, revealing the wall behind. Both of those should count as background, even though they are vastly different colors. Obviously I will need to do more reading if I want to understand it. It's available in this paper: Improved Adaptive Guassian Mixture Model for Background Subtraction.

MOG2 also has a built-in shadow identification. Again, I don't know the details, but it flags shadows as not being foreground, and it decides something is a shadow if it is an appropriately dimmer version of the same color.

This algorithm requires many more frames of input in order to decide what the background looks like, so I've had to feed it each frame to set up the algorithm. The following code does all this, and displays/saves the foreground image every 100 frames.

#include <iostream>
#include <cstdio>
#include <opencv2/opencv.hpp>

using namespace std;

void DisplayProgress(cv::Mat& img, cv::Mat& background, cv::Mat& foregroundMask, int frameindex)
{
 cv::imshow("Original", img);
 cv::imshow("Background", background);
 cv::imshow("Foreground Mask", foregroundMask);
 // process the foreground further to remove noise and shadows, etc
 // shadows are masked with value 127
 cv::Mat noShadowForeMask = foregroundMask & (foregroundMask != 127);
 cv::Mat smoothForeMask;
 cv::GaussianBlur(noShadowForeMask, smoothForeMask, cv::Size(11,11), 4.0, 4.0);
 cv::imshow("Foreground Blurred", smoothForeMask);
 cv::Mat binarySmoothForeMask = (smoothForeMask > 64);
 cv::imshow("Foreground Blurred Binary", binarySmoothForeMask);
 // extract the foreground picture
 cv::Mat forePic;
 img.copyTo(forePic, binarySmoothForeMask);
 cv::imshow("Fore Picture", forePic);
 // save the foreground
 const char* filenameFormat = "/home/me/src/ping/out%03d.png";
 char namebuff[256];
 sprintf(namebuff, filenameFormat, frameindex);
 cv::imwrite(namebuff, forePic);
 // wait for user to hit a key before continuing
 cv::waitKey(-1);
 cv::destroyAllWindows();
}

bool LoadImage(const char* filenameFormat, int frameIndex, cv::Mat& fillMeWithImage)
{
 char filename[256];
 sprintf(filename, filenameFormat, frameIndex);
 fillMeWithImage = cv::imread(filename);
 return (fillMeWithImage.data != NULL);
}

int main(int argc, char** argv)
{
 cv::Mat img;
 cv::Mat foreground;
 cv::Mat background;

 cv::BackgroundSubtractorMOG2 bgSub(200, 10.0, true);

 const char* filenameFormat = "/home/me/src/ping/movie1png/movie1-%03d.png";
 for (int frameindex = 1; /*infinite*/; ++frameindex)
 {
  if (!LoadImage(filenameFormat, frameindex, img))
  {
   cout << "Can't find frame " << frameindex << " so assuming we reached end of movie" << endl;
   // display last progress at last image in the movie
   // re-read the last image that existed
   --frameindex;
   LoadImage(filenameFormat, frameindex, img);
   bgSub.getBackgroundImage(background);
   DisplayProgress(img, background, foreground, frameindex);
   break;
  }
  // learn the new image
  bgSub(img, foreground);
  cout << "Added frame " << frameindex << " to background subtraction processor" << endl;
  // display progress occassionally (every 1.66 seconds at 60 fps)
  if (frameindex % 100 == 0)
  {
   bgSub.getBackgroundImage(background);
   DisplayProgress(img, background, foreground, frameindex);
  }
 }

 cout << "Done" << endl;
 cv::destroyAllWindows();
 return 0;
}
I'm doing a little extra processing on the foreground decision it makes. I ignore shadows as not being foreground (thank you built-in functionality). I then blur the true/false mask, and then make it true/false again. Effectively I'm looking for pixels whose neighbors are foreground, but that weren't foreground themselves. This nicely prevents lone pixels in the middle of a foreground blob from being excluded unfairly. It also expands the region marked as foreground, which might be a bad thing.
For convenience of comparing these results to Matlab, I've run it on the same movie I used in that blog post, and manually grabbed the foreground pictures from the same frames I used in that post (instead of looking at each 100'th frame). And here they are, in the same order they appear in the Matlab blog entry (stupid me: not in chronological order).





Of course, this is just the background subtraction, and ignores the findBall aspect. No red crosses in these images. But the subtraction seems to be fairly good, and the extra work I've done (blurring/thresholding) has removed the noise of the table and bush shimmering. It doesn't have some of the drawbacks of my Matlab method, like looking only at increases in intensity.

It still would suffer from requiring good contrast with the background -- I've run it on my second video and had contrast problems. In fact, here is a raw image from a frame in that other video. Where's the ball? Even a human would have a hard time finding it without the context of where the ball was last frame. Yep, it's that little slightly-lighter-gray smudge in the middle. So contrast is still a problem.



Efficiency

Running background subtraction in OpenCV has not proven to be very efficient. With the ~1MP images I'm working with, it was taking much more than 1/60th of a second to process a frame. I'm estimating it was processing about 25 frames per second. This would mean that it can't be run in real time without improvements.

I have some ideas there. The first is using Region of Interest ("ROI") capabilities of a camera, or even on the software side. If you know the thing you are interested in (i.e. the ball) is in a particular region of the picture, only process that region. If ROI is implemented on the camera, that means the camera will only send pixels in that region (which also saves on bandwidth!). I imagine I can find the table and the ball once, then in each incremental frame I could have a very good guess about where to look for the ball. I might only need to process the whole image again if I lose track of the ball.

The second idea is to run the update on the background model (the mixture of gaussians) only periodically. Or, better yet, run for a large number of frames with background only. That would parameterize it to recognize the background (leaf, not leaf, etc) without getting confused by a stationary-but-foreground object, like the opponent's body. Then running the model on live action frames might be faster without the update to the model. The OpenCV code doesn't seem organized in a way that makes that possible, but that's what's good about open source: I can change it.

Conclusion

I conclude that OpenCV is going to be a great tool, but at the same time BackgroundSubtractorMOG2 may or may not be appropriate for me. It is compute intensive to the point that I can't run this way at full speed. My results in Matlab were almost as good for background subtraction (and could easily be reimplemented in OpenCV).

So in this update I made progress only in that I was able to use my new tool: OpenCV. But at least it is progress.

Wednesday, January 23, 2013

First attempt

This is my first attempt at doing some computer vision work.

The Camera

Since I haven't really decided what I need yet, I used the supplies and tools I had at hand. So I went out and recorded some video on my Canon Rebel T3i. That's an SLR camera -- not a video camera per se, and certainly not a high-performance video camera. But it can shoot 60 fps at 1280x720 (a.k.a. 720p) and save it to a memory card. It cannot stream a live feed in real-time to a computer. It was a bright day for this outdoor shoot, which was in the shade. That, combined with the overall quality of the camera and the lens, meant that the video turned out fairly well. I chose to shoot from behind me and to one side, above the table, which made it easy to keep the table in the frame.

Here is the footage that I'm working from. It has been webified down to 30 fps and compressed -- the original was in a higher quality .mov format at 60 fps.


Extracting Frames

With that video captured, I came back inside and started the processing. It was easy to copy over to my computer, as Ubuntu recognized the camera as a mass storage device, and I copied the .mov file. The file was 67MB. Simple.

I had some trouble getting the replay to work in some Ubuntu players. VLC seemed to do the best job, so that's what I've been using ever since. It plays nicely.

I wanted to extract individual frames from the video, so that I could attempt to identify interesting features in it. To do that, I used the ffmpeg/avconv tool, available in the Ubuntu repository. This is what I did:

ffmpeg -i movie1.mov -r 60 -f image2 pngofmov/image-%03d.png

This created a separate png file for each frame of the movie. With 7.5s of footage, I get about 7.5*60 = 450 frames. Each frame's png is about 1.3MB, for a total of 585MB -- much bigger than the original .mov.

Processing

I used Matlab to do this processing, as a) I have it available, b) I understand how to use it, c) I didn't know any better. Note that I do not own the Image Processing Toolbox, so I'm using the base Matlab package.

The good news is that Matlab makes it easy to load images, display them, and treat them as 3D matrices of numbers. That meant that it took me just a few minutes to get started and under and hour to complete this whole task.

Background Subtraction

I chose to focus on a handful of images from the movie, to see if I could identify the ball in them. To do this, I had the idea of background subtraction: subtracting one image from another to identify the differences. If you subtract an image with only background in it from an image with action on top of the background, you should be left with just the action. Sounds easy.

Seriously, this is how much code this takes in Matlab:
img291 = imread('image-291.png');
imgbase = imread('image-060.png');
imgdiff = img291 - imgbase;
image(imgdiff);
print('-dpng', '291minus060.png');

Well, here is my first attempt.

Background Image

Image With Action

Foreground Through Subtraction
Wow! That actually worked! The ball clearly appears, as does my arm with a paddle. Now, it might be hard to see in this web-sized image, but there is additional noise in the image. There is movement in the bush in the background and even the edges of the table seem to be shimmering enough to cause the odd pixel to light up. But the overall effect is pretty clearly a success.

Finding the Ball

How do I turn that into an algorithm to find a ball? Well, after a little bit of trial and error (and a little googling for how to create a circle mask), I came up with this little function. It searches for the greatest intensity increase that is in the shape of a circle of a particular size.

function circxy = findBall(imgbase, img)
    % do the subtraction
    imgdiff = img - imgbase;
    % average the red,green,blue pixels to get grayscale
    imgdiffgray = mean(imgdiff,3);
    % define a mask/kernel with a 12-pixel radius circle in the middle
    % the kernel has +1 inside the circle and -1 outside the circle
    crad=12;
    ix=sqrt(2*pi*crad^2);iy=ix;cx=ix/2;cy=cx;
    [x,y]=meshgrid(-(cx-1):(ix-cx),-(cy-1):(iy-cy));
    circmask = ((x.^2+y.^2) < crad^2);
    circkern = circmask * 2 - 1;
    % apply the mask to every location in the grayscale difference
    circfind = conv2(imgdiffgray, circkern,'same');
    % find the location where the kernel fit the best
    [circx,circy] = find(circfind == max(max(circfind)));
    circxy = [circx,circy];

So how does it do? Let me show you:

Yep, that's a red cross on top of the ball. It found it, and I would argue quite accurately. Here's a zoomed in look.


Yep, that's pretty accurate. Was it a fluke? I ran it with three more images. Here are the results, zoomed in on where it placed the red cross.





Those red crosses are all on the ball.

I left this session feeling pretty good about how I was doing. In fact, I still feel pretty good about how I did. But, looking back on it later, I have a few concerns with this approach, even though it worked on all four of the frames I tried it on.

  • This required an initial background-only image. I'm not sure if that is realistic or not. For now I won't worry about it.
  • My image subtraction approach looks only for where a particular color has increased in intensity. If something gets darker, it is excluded from the difference (it actually gets a negative value in the raw matrix, and matlab displays this as black). Perhaps an absolute value subtraction would be more appropriate?
  • This means that my method is very dependent on the contrast between the ball and the background. If I had a pale background -- say that light gray wall -- the contrast would be lower, and it would stand out less. It's possible that my circle-finding kernel wouldn't choose it as the strongest match.
  • My circle kernel is a predefined radius. Admittedly, I looked at an image or two, and decided it was about 12 pixels in radius. This varies depending on how far the ball is from the camera, and all these test images have the ball a fairly similar distance. (On the plus side, we might be able to use the radius to approximate its distance from the camera!)
  • Motion blur is clearly visible in some of these test images. The ball ceases to be circular, and instead turns into a round-ended rectangle (the path a circle sweeps as it moves). This is most obvious in the last image, where the ball must have had its greatest velocity. It still found the ball in this image, but I suspect it was less certain. (On the plus side, we might be able to use motion blur to estimate the velocity of the ball!)

Epilogue: I have run this algorithm on another set of photos, and it wasn't so successful. Those images had more movement (an actual whole person) and had patches of bright sunshine in the background. The first findBall I tried identified my forehead as the most likely ball in the image.

Thursday, January 17, 2013

Quantity of data

I've been doing some looking at cameras. I'll have plenty to say on the cameras themselves later, but first I want to take a look at the sheer quantity of data that I'm proposing to capture.

Let's say we want 120fps. There seems to be an industry standard to use multiples of 30fps until getting into the many-hundreds. So I'm just rounding up the 100fps I asked for in my previous post.

Let's say we want 1 megapixel. That's a 1024x1024 image, if it was square. It probably won't be square, but that's a good approximation of 1280x720 or similar resolutions. I think this is a decent resolution with which to capture a moving ball without the fancy zooming used by the Ishikawa Oku Lab.

Let's say we want 24 bits per pixel. I'm least confident about this requirement. But that's enough bits to give an RGB image with 8 bits for each color, or 0 to 255 values for each color for each pixel. That's a pretty common image format, I believe, so I think it is reasonable. I'm avoiding greyscale intentionally: I think that the color of the table (blue) vs the lines (white) vs the ball (white right now, but I'm thinking of changing to orange) vs the paddles (red and black) is going to be important.

So what does that add up to, in terms of the quantity of data produced?

120 * 1024 * 1024 * 24 = 3,019,898,880 bits per second = 2.88 Gibabits per second

But wait! I intend to use stereo vision, so that's two cameras: 5.76 Gbps.

Since I want to process this on a computer, and I need it done in real-time, this much data has to be sent to my computer in a constant stream. I would also have to be able to process this much data, otherwise there is little point in sending it to the computer.

Transmission Medium

Focusing on the transmission first, does this introduce any problems?

USB 2.0, the most prevalent USB format in use today, is 480 Mbps, well below what I need. So something like a USB webcam, if it were to offer the frame rate and resolution I want, wouldn't be able to send all that data to me.

There are three other common formats used by fancy cameras: GigE, USB 3.0, Camera Link.

GigE is just using standard networking. It can run over a standard $5 CAT 6 cable, and can even be switched and piped around using a $30 network switch. The cameras have some built-in electronics to convert their data into UDP to be sent over the network. On the receiving end, you just need a standard GigE network port, and then some software to interpret it. This means that, after the camera itself, there is almost no cost, and that is very appealing. Of course there is a problem with this: bandwidth. GigE is so-named because it can transmit 1 Gbps. Since I'm proposing 2.88 Gbps per camera, it won't all fit on the wire. So GigE is eliminated if I want to stick to my specs (but notice that if I went to 8 bits per pixel -- greyscale -- it would fit!).

USB 3.0 is the newest USB standard. It is still rare, but is starting to be adopted. I believe one of my computers at home supports a single USB 3.0 plug, and I imagine that there are other motherboards out there that would accept two of them. USB is also cheap when it comes to accessories, because it is a standard format. I can buy cables for $5. So far it sounds good -- but what about bandwidth? Well, it's adequate: USB 3.0 is specified to handle 5 Gbps of traffic. So that would comfortably hold a single camera's data. I would need one cable for each camera, and my computer would have to be able to handle them both simultaneously.

Camera Link is designed just for cameras, which sounds promising. It comes in a few different "sizes", which are really just using multiple cables cooperatively. The cables are not very common, really only used by deep-pocket researchers and industry. I seem to find them starting at $200 each. They also will require a card in my computer to receive the signal over the cable, to get it into computer memory. Those are apparently called "frame grabbers" and I haven't found prices on them yet... I expect they are > $500 and possibly into the many thousands. On the plus side, the bandwidth is there. Camera Link Base (one cable) is 2040 Mbps, and Camera Link Full (two cables) is 5.44 Gbps. Yes, I see that two cables is more than double one cable, but they do something fancy to make that happen. So I could fit a single camera on a Full cable. If my requirements were lowered a little, it might fit on a Base cable.

Computer Side

So once I get the data to the computer, how realistic is it to process this information?

Well, first consider that in some cases it will have to move across the PCIe bus from an add-in card. Good news: PCIe 2.0 (which I imagine is most common) can transmit 8 Gbps with 2 lanes, which is not at all hard to find.

So now we can get it to the motherboard itself, and presumably into RAM. DDR3-800 SDRAM can be accessed at 51.2 Gbps, so memory doesn't seem to be a problem. We can't keep it in RAM for any significant length of time, or else we're going to need a lot of RAM. So basically we want to use frames as they come and discard them immediately.

What about CPU speed? If we had a 3.6 Ghz 8-core machine (which I don't have at home, but I could if needed), we have a total of 3.6 billion * 8 = 28.8 billion clock cycles per second to use. That's 240 million clock cycles per frame. If I wanted to do any operation on a frame that required calculating something for each pixel (any convolution operation seems to fall into this category), that's 240 cycles per stereo pixel. That's pretty tight to do anything fancy and I'm going to need to pay attention to this and look for ways to reduce CPU load.

Conclusion

This exercise has brought me to conclude that if I want 24-bit color 1 megapixel images at 120fps, I'm going to need USB 3.0 (easy and cheap, but new) or Camera Link (expensive and specialized, but established). I'm also going to have to pay close attention to the demands I am placing on the CPU.


Note: source for all the bandwidth numbers was Wikipedia, here.

Wednesday, January 16, 2013

Prior art: zoom on a moving ball

After deciding that the vision phase of the project was going to be my focus, I found some prior art there too. The Ishikawa Oku Laboratory at the University of Tokyo has some impressive stuff they've done.

Here's the brief video of their accomplishments in ping pong:


To summarize, they track fast-moving objects -- and stay zoomed on them -- by using some high-speed servos to move mirrors, instead of moving the camera. That allows them to see detail on the ping-pong ball as it flies through the air. They get enough detail that they can identify the spin on the ball, because they can zoom in so much.

I imagine that with that kind of detail, a robot could make some pretty good "thinking" phase decisions, and play a very competitive game.

So what can I learn from them? Thankfully, like good academics, they have released a paper that provides some details: High-speed Gaze Controller for Millisecond-order Pan/tilt Camera.

I learn that they used some expensive equipment -- at least in the scale of the self-funded researcher. They are using two M2 mirrors from GSI, which are high-quality mirrors on high-speed servos, designed for scanning stuff with lasers. I think I saw them costing $700 each but I can't find the link now; maybe I'm fooling myself. They also use a 1000fps camera which, as I've discovered since, is an expensive toy. I think it is described in more detail in this non-free paper. They also use their own custom fast-acting lens to keep everything in focus at those speeds.

Altogether, this is out of my league... by an even greater margin than the rest of the project. But do I really need all that? I don't think so.

Let's think about the needed frame rate on the camera. I'd like to think of it in terms of how far the ball moves between frames, and try to keep that reasonable. That requires knowing how fast the ball is moving. This site concludes 30 m/s is the upper end for professional players. I did some rough calculations on me playing at a very gentle pace against the playback on my table: 4 m/s. I'm going to use 10 m/s as my target speed. So at 1000 fps, the ball moves 10mm. That's pretty small. Not a lot happens in that time. It's not like the ball is actively powered -- it isn't accelerating itself. There should be predictable forces acting on the ball: gravity, air resistance, spin, bounces off a uniform table. I think we could easily get away with the ball moving 100mm between frames, if we had good precision about our measurements in each frame. So that would mean 100 fps. Any slower and I do worry that we wouldn't have enough observations of the ball in flight to accurately determine its future path. This jives with the Zhejiang group from my previous post, who use 120fps cameras.

The second strength Ishikawa Oku has is the ability to zoom in on the ball. That would be nice, as it would allow us to determine its position and velocity more accurately by using the same pixels to cover a smaller area. But the complexity of their tracking system just isn't realistic for my first attempts. Ditto with detecting spin (a product of high frame rate, high resolution via zooming, and high-speed focus). I think it is a level of refinement more than I need to get the basic job done.

So this post reviewed some very cool work, but I've decided that is is overkill for my purposes. I'm looking for something around 100 fps that doesn't need any fancy tracking.


As an aside, the Ishikawa Oku Lab is my most favoritist. They seem to do all sorts of awesome robot stuff. Like this high-speed robot hand that I stumbled across a year ago and that still amazes me.

Mission Statement

Yes, you read the title correctly. I'm talking about robots. Robots that play ping pong.

Do I have such a robot? Of course not. But that's the point of this chronicle. I'm going to give it a shot. I'm going to see how far I get. I'm going to see how long it takes to get bored of the project.

So, ideally, what am I trying to accomplish? Well, I want a robot to play ping pong against. I am motivated to build this robot because I like ping pong, but I don't like people. If I want to play ping pong, I'm going to have to build an opponent who will play when I want, at my home, for as long as I want, and then not hassle me about it when I don't want to play. I imagine this robot would have some commercial value, if it plays well enough, but that is not my motivation.

You might wonder about my qualifications. Simple: I have none. Well, not true. I have a ping pong table (which happens to be an outdoor table for the moment) and I have played ping pong (not terribly well). I have no engineering experience, let alone robotics experience. I can do some decent computer programming, but have no experience in anything that applies specifically to this problem. So this is not a tutorial on how to approach such a project. It is a record of how I approached it, as an amateur.

I know that robot ping pong is not an original field of study. In my first hour of investigating prior art, I found this Chinese team at Zhejiang University. Here is their impressive video:


You might think it would discourage me to learn that someone has already accomplished my goal. But I look at it as inspiration: it can be done! It took a group of graduate students who already knew something about robotics, a bunch of money, and a bunch of time. So I'm at a slight disadvantage... but it can be done! There are very few details of their project available, from what I can find, so I don't think there will be much I can borrow. I know their robots are enormous and humanoid (30 motors each, I believe I read), they don't seem to rely on external hardware (like a ceiling mounted camera or an accelerometer in the ball), and they use 120fps video as their main input. They seem to be using standard ping pong equipment, except for the 6 green dots on the table (which may or may not be used by the robots to make their jobs easier).

I see this project as having three barely-related problems: how to observe the game, how to think about the game, and how to take action. Since there is only one of me, I will probably tackle these problems in series, rather than in parallel.

I've chosen to focus on the first problem: observing the game. That means building a computer representation of the physical game in real-time. If I can draw a 3D model of the table and the ball in real-time, and perhaps the paddle of the opponent, I will consider this step a success. It has to be accurate enough to (in later project phases) decide where and how to swing my robot's paddle. And it has to be fast enough -- close enough to real-time -- such that the ball hasn't passed my robot before it acts.

Looking ahead to the "think about the game" phase of the project, I expect to do something minimal at first, as the robots at Zhejiang have. If I can hit simple shots back to the center of the table, that will be good enough. So I don't think there is all that much to the thinking.

The "take action" phase is going to be the most difficult, given my lack of robotics experience and the high cost I expect for the hardware. That's why I'm not doing this phase first. But if I happen to finish the "observe" phase successfully, I should be willing to invest the money and time into finishing the project. I've been meaning to experiment with robotics anyway, so this will be a nice way to get into it.

That's it for an introduction. I've already tried a few things before deciding to retroactively start this blog, so in theory I will post something new about it soon.