1. Lightweight algorithms.
As explained earlier, in order to make Appliance Reader
cost-effective, we wanted it to run on smartphones. At the time of initial research, 2012, smartphones were pretty limited in terms of processing power; therefore, the algorithms to be used couldn't be too heavy that they wouldn't run on smartphones.
2. Work on different appliances.
3. User should not need to modify their appliances.
Complex modifications such as to an appliance's electrical
hardware should be avoided.
Based on the constraints above, we came up with the following flow of a smartphone application:
1. Selecting an appliance
When the app is just opened, there is a list of supported appliances. User navigates through different appliances by scrolling his/her finger over a list. And as the user swipes, voice-over system will speak out the appliance that the finger is touching. To confirm a selection, user double taps the screen.
2. Taking a picture
After an appliance is selected, the application goes into the picture-taking mode, where camera-stream is opened. Depending on how powerful the device running the application is, there are two possible user interfaces. The first, which is dedicated for less powerful device, requires the user to tap on the screen when he/she thinks that the appliance's digital display is within the camera's field of view. The second, which is for more powerful device, doesn't require the user to tap on the screen; he/she only needs to point the camera onto the appliance's digital display. The second approach is computationally heavier because a picture needs to be taken and processed continually. In this writing, we use the first approach.
3. Processing the picture
After the user taps on the screen, Appliance Reader processes the picture on the background. Meanwhile, user is asked to wait while a progress-bar is shown on the screen.
4. Interpreting the result
If the interpretation can be made out of the picture taken from (3), the result would be verbalized to the user using text-to-speech engine. If, however, the interpretation fails, for example due to the picture being too blurry or due to the digital display being cropped off, user is asked to take the picture again, so we'd go back to (2).
The General Idea of Our System
The hardest part of the computer vision and machine learning aspects of this project is in in their details. The main concepts are fairly simple, but the practice is pretty challenging as there are many unexpected problems along the way. Instead of describing all the nitty-gritty details immediately, I'll first illustrate only the high-level concepts. The implementation details and challenges will be explained in the next section.
Reference Image
After the high-level design of the user interface is decided, we came up with our main idea of this project, which is to have a reference image for each of the supported appliances. Reference image is an annotated picture of an appliance's digital display where, the locations and semantics of the each of the components in the display are known. To illustrate this concept, refer to the following figure. The picture of the thermostat, without any of the annotations - the texts and boxes - is the reference image. The cartesian-coordinates of the four boxes that surround the temperature and humidity are also known. And the semantics of the display is that the temperature can be interpreted from the two boxes on the left while the humidity is from the two boxes on the right.
So how is a reference image used? A picture of the appliance that the user takes is transformed to look like the reference image, as shown on the following figure. Homography is used for this transformation. Homography matrix is a matrix of size 3x3, which when applied onto the user-taken image transforms it as if it is taken from the reference image's perspective.
The purpose of this transformation is to remove the "skew" out of the picture that is taken by our visually-impaired user. Without this, things like translations, rotations, scaling, and wrapping would have to be accommodated when interpreting the digits on the display. On the other hand, by applying the transformation, the locations of the boxes that surround the temperature and humidity digits are immediately known, therefore reduces the complexity needed for interpreting the display.
With the information provided from the reference image, we have sufficient information needed to interpret the picture taken. First, we crop out the individual regions that make up the digits of the temperature and humidity. After that, we interpret each of the regions using machine learning. And finally, we verbalize the result of interpretation and speak it out to the user using text-to-speech engine.
Technical Details
In this section, I'm going to provide more of the theories, details, and challenges that we came across when we worked on the project. Although the project is conceptually pretty straight-forward, the implementation wasn't as direct. Most of the challenges came from the "uncertain-nature" of the computer vision and machine learning problems. More concretely, they are about getting the computer vision and machine learning algorithms to work robustly given the constraints that we had. And for these, we had to spend significant efforts trying to understand the root-cause of the problems, which frequently required us to write separate programs to visually-inspect the problems.
The technologies used are
Android smartphones,
OpenCV computer vision library, and deep learning with
Caffe frameworks. Android was used because it is the mobile programming platform that we were the most familiar with. And as computer vision is a pretty complex field by itself, we used OpenCV, which was one of the most popular open-source computer vision library, so that we wouldn't have to implement many of the algorithms on our own. Furthermore, OpenCV was also supported on different platforms such as Windows, Mac, Linux, Android, and iPhone. Finally, deep learning method is used as it was recently found to work really well on computer vision problems similar to what
Appliance Reader was trying to solve. Deep learning bested the state-of-art approaches on problems such as image-recognition.
Instead of immediately prototyping in Android phone, we mostly worked on Unix-based desktop. The thought is that debugging in a desktop is a lot more convenient than in a phone. And to ensure that the program would run on the phone, we occasionally ported our progresses onto it. For example, this
one is a program that does homography transformation on Mac OSX, while this
one is the Android port of it. The prototyping was done in different ways. The real-time one was to have an Android phone to capture the picture of an appliance, which was then immediately sent to the desktop for processing, via TCP connection. And an offline way was to capture a number of pictures - simulating how real users would use it - with the phone and then kept using all the pictures for different experiments on the desktop.
Homography transformation
To transform the user-taken image to look like the reference
image, we use homography transformation, which is also known as
projective transformation. Homography transformation is done by
applying a particular 3x3 matrix to the user-taken image. When compared
to the well-known affine transformation, homography has two more
degrees of freedom, hence it has 8 degrees of freedom in total. Since
homography transformation has been successfully used in 3D
reconstruction problems, such as reconstructing 3D model from 2D
images and projecting 3D points onto 2D space, this transformation is naturally a good candidate to solve our problem.
This transformation is used to reduce the
complexity of the interpretation (machine learning) problem, in both of the training and classification
phases. To illustrate this,
consider taking a picture of an appliance and then using it for interpretation. The size of the digital display on the picture that is just taken would depend on the distance of the camera from the appliance. When the
distance is shorter the appliance would appear bigger, and so would
the symbols on the display that need to be interpreted by using machine learning. Without the
transformation the classification needs to be
done across different scales, which therefore increases the complexity of the interpretation. In addition to that, the complexity needed for training the machine learning model would
also increase as the model needs to be able to deal with more varieties of negative samples from different scales. Moreover, other factors such as rotation, translation, and skew also need to be considered similarly.
1. Correspondences
Given a picture of an appliance that is to be transformed, we first need to compute its correspondences with the reference image. Correspondences are how the points on the picture map onto the reference image, so it essentially is a 2D -> 2D mapping. These correspondences are then used to solve create the aforementioned homography matrix. Math-wise, only four correspondences are needed to create the matrix.
2. SIFT
While we can compute
correspondences by hands fairly easily, h
ow do compute them programmatically? We computed SIFT features for both the picture and reference-image and used RANSAC estimation algorithm to approximate correspondences from those computed features. Features are basically points that are considered the "key-features" of an image. A common example is an edge, and to illustrate why an edge is interesting, consider a triangle and a rectangle. A triangle is uniquely-characterized by its three edges and a rectangle is similarly by its four edges. SIFT feature has been proven to work very well, mainly due to its invariances towards transformations such as translation, rotation, and scaling, and also partially towards geometric distortions and illuminations. What it means is that the same features would be extracted (to a certain confidence) out of versions of an image where the combinations of the aforementioned transformations are applied onto. As for Appliance Reader, the reference-image and the user-taken-image are both of the same appliance, so the expectation is that same features would be extracted from both, by using SIFT extraction algorithm.
3. RANSAC
Now that we have features computed from
reference image and user-taken image, we find the match between the two sets of features. For this, we used
"knnMatch" algorithm, which finds k best matches for each of the features. These matches are not accurate because they are just estimated from their mathematical distances. Now RANSAC is used to compute the
correspondences. What it does is iterating on randomly picking a subset of the matches, computing a homography matrix out of them, and testing the accuracy of the matrix, in order to find a good homography matrix. How good the matrix is, is determined by the number of matches that agree with the transformation. The more it is, the better the matrix is.
Challenges:
a. Some appliances were so simple in appearances, that the number of features computed is insufficient to get good homography result.
To solve this problem, we put "interesting" stickers on the appliances. For example, the tiger and elephant stickers on the following appliance:
b. Certain appliance/background has too many "similar" features computed out of it, making the combination of feature-matching ("knnMatch" mentioned before) and RANSAC estimation ineffective.
Since the RANSAC step relies on the assumption that there are "good" matches out of the feature-matching, the pipeline didn't work well on appliance/background that has too many "similar" features. Some examples of this kind of appliance/background are carpets and speakers that have lots of small holes.
To alleviate this problem, we ended up using one of the tricks mentioned on David Lowe's original paper about SIFT. The trick is that, for each of the feature matches, compute two-best matches instead of just one (hence, we used "knnMatch"). And only if the two best matches are far-apart, we use it on the proceeding RANSAC step. The key intuition here is that matches that are not "unique" enough are not good matches.
void CVHelper::featureMatch(const Mat& queryDesc, const Mat& trainDesc, vector<DMatch> &matches) {
BFMatcher matcher;
vector<vector<DMatch> > tempMatches;
matcher.knnMatch(queryDesc, trainDesc, tempMatches, 2);
const float ratio = 0.8; // As in Lowe's paper; can be tuned
for (int i = 0; i < tempMatches.size(); ++i) {
if (tempMatches[i][0].distance < ratio * tempMatches[i][1].distance)
{
matches.push_back(tempMatches[i][0]);
}
}
c. How to check that a the homography transformation was successful?
After applying the tricks above, we had a pretty decent success rate on transforming user-taken picture to look like the reference-image. Nevertheless, the transformation still sometimes failed, for example due to the picture taken being too blurry (e.g. the phone shook while taking the picture) or really bad illumination. And so the question is, how do we know when we need to take another picture?
An idea is to use inverse-transformation. So after the user-taken picture is transformed, we can try to transform it back to look like the original and programmatically check if it is close enough with the original.
Interpretation
TODO: finish this....