Wednesday, October 7, 2015

Appliance Reader


Background

What is Appliance Reader?
As a blind person, Jane relies on braille stickers to use her appliances: oven, stove, microwave, and laundry machines. She puts the stickers on each of the appliance's buttons to identify its functionality. Furthermore, since she can't use the stickers to get the appliance's feedbacks, such as how many more minutes until the cookies are baked or until her laundries are done, she has to specifically buy appliances with physical feedback, such as a laundry machine that is controlled by a dial-based timer. However, this type of appliances is becoming obsolete and hard to find, as digital display-based ones are becoming more ubiquitous. Appliance Reader project aims to help people like Jane reading digital displays.

What device does Appliance Reader run on?
People with visual-impairment often rely on very special hardware to do certain tasks. For example, gadget like BrailleNote [1] is commonly used to take notes because they provide convenient and accessible ways of typing and reading. However, such specialized gadgets are often very expensive due to their relatively small niche-market. By the time of this writing, BrailleNote costs up to a few thousands of dollars. On the other hand, smartphones, which are much more affordable due to being mass-produced, are becoming a more popular substitution for such specialized gadgets; features such as text-to-speech, voice over, and optical character recognition have made smartphone accessible for people who suffer from visual-impairments. Appliance Reader is a smartphone application that uses the phone's camera to read appliance’s digital display. User takes pictures/video streams of the appliance’s digital display and the phone would interpret and speak out the reading.

Who worked on this project?
This project first started at Prof. Richard Ladner's Mobile Accessibility lab, which is one of the research groups at the University of Washington Computer Science and Engineering department. The people who originally worked on this project were Prof. Richard Ladner (CSE Professor at UW), Bryan Rusell (Computer Vision Researcher at UW), Antonius Denny Harijanto (ex-undergrad student), Michael Hotan (ex-undergrad student). During our undergraduate years, Mike and I started to work on this project from scratch, with close guidance from Prof. Ladner and Bryan. Bryan taught us a lot of computer vision concepts that were necessary for this.

We worked on this from 2012 until Mike and I graduated from in 2013. Although we made a significant progress of realizing what initially was only an idea into partially-working implementations, we weren't able to fully finished the project by the time we graduated. Machine learning portion of this project - for the interpretation phase - was the only significant piece left to be completed. 

As an effort to study machine learning, I continued working on this project during my spare time in 2015. As I feel that the project is now very close to a completion, I'm writing this post to share what I've learned. I'm hoping that this post will be beneficial! :)


Challenges

First thought.
When we first started the project, we thought that problem was pretty straight forward: read, interpret, and then present. More concretely, we were thinking of reading an appliance's digital display by running a publicly available OCR engine over the camera-taken pictures, interpreting the results as a text using appliance-specific rules, and then reading it out loud using text-to-speech engine. However, we found the problem to be pretty complex and so we decided to create a system that uses computer vision and machine learning to solve it.

What's the problem with OCR?
After we experimented with Tesseract, which was considered to be one of the most accurate OCR engines publicly available at the time, we realized that OCR would not work well for our purposes. From our research and experiments we came up with several hypothesis that explain this. First, OCR engine is generally designed to read scanned text documents and so it makes use of common characteristics of the documents, such as text alignments, uniformity in background color (methods used by top scorer of ICDAR 2015 competition [3]), and uniformity in the font sizes across the documents. As an example, [X from paper Y][1] proposes a method to retrieve bounding boxes of text regions by scanning through a document with a downward-moving horizontal line that figures out the density of the pixels that each of the lines crosses. These densities are then used to predict the texts' locations. This heuristic sounds very reasonable for documents with lots of properly aligned texts, but not for appliances' digital displays which usually consist of only a handful of characters. Secondly, appliances' digital displays often use uncommon font types and non-alphanumeric symbols. And since to support these an OCR engine is required to undergo additional training anyway, designing our own trainable system would give us more flexibility and control. Thirdly, pictures of the digital displays that are captured by people with visual-impairment would be skewed since it might be hard for them to angle the camera perfectly; OCR doesn't work well on skewed pictures since it is designed to interpret documents with good perspective (e.g. scanned documents). Fourthly, since the pictures are taken via smartphones' cameras, they may have different varieties of lighting and may suffer from reflections due to the digital displays' glossy surfaces. And once again, these problems are not faced by the kind of documents where OCR is usually run on. In our experiments, even on carefully taken pictures that are taken perpendicularly to the display's plane, have good lightning, and have no glares, it was still hard to get good OCR results.


Here are some appliance pictures that illustrate the issues described:


Notice how bad the reflections are, and that "Energy Saver" and "Cubed" are indicated by unusual symbols that any off-the-shelve OCR engines wouldn't recognize.


Notice the rotation on the picture and how the font sizes are not uniform.

[TODO: Put link to OCR experiment]


High-Level Concepts of Our System

Constraints
After we decided that OCR was not a good choice for our purpose, we started designing our own system that uses computer vision and machine learning. We started by deciding the system's constraints, which are the result of considering the specification of Appliance Reader: hardware where the application is run on, desired user-interface, and general characteristics of appliances' digital displays. These are the constraints the came up with:

1. Lightweight algorithms.
As explained earlier, in order to make Appliance Reader cost-effective, we wanted it to run on smartphones. At the time of initial research, 2012, smartphones were pretty limited in terms of processing power; therefore, the algorithms to be used couldn't be too heavy that they wouldn't run on smartphones.

2. Work on different appliances.
One of the goals of Appliance Reader was to support as many appliances as possible. This meant that the system needed to be extensible; adding support to a new appliance should not require too much efforts.

3. User should not need to modify their appliances.
Complex modifications such as to an appliance's electrical hardware should be avoided. 

User-Interface Design
Based on the constraints above, we came up with the following flow of a smartphone application:

1. Selecting an appliance
When the app is just opened, there is a list of supported appliances. User navigates through different appliances by scrolling his/her finger over a list. And as the user swipes, voice-over system will speak out the appliance that the finger is touching. To confirm a selection, user double taps the screen.

2. Taking a picture
After an appliance is selected, the application goes into the picture-taking mode, where camera-stream is opened. Depending on how powerful the device running the application is, there are two possible user interfaces. The first, which is dedicated for less powerful device, requires the user to tap on the screen when he/she thinks that the appliance's digital display is within the camera's field of view. The second, which is for more powerful device, doesn't require the user to tap on the screen; he/she only needs to point the camera onto the appliance's digital display. The second approach is computationally heavier because a picture needs to be taken and processed continually. In this writing, we use the first approach.

3. Processing the picture
After the user taps on the screen, Appliance Reader processes the picture on the background. Meanwhile, user is asked to wait while a progress-bar is shown on the screen.

4. Interpreting the result
If the interpretation can be made out of the picture taken from (3), the result would be verbalized to the user using text-to-speech engine. If, however, the interpretation fails, for example due to the picture being too blurry or due to the digital display being cropped off, user is asked to take the picture again, so we'd go back to (2).

The General Idea of Our System
The hardest part of the computer vision and machine learning aspects of this project is in in their details. The main concepts are fairly simple, but the practice is pretty challenging as there are many unexpected problems along the way. Instead of describing all the nitty-gritty details immediately, I'll first illustrate only the high-level concepts. The implementation details and challenges will be explained in the next section.

Reference Image
After the high-level design of the user interface is decided, we came up with our main idea of this project, which is to have a reference image for each of the supported appliances. Reference image is an annotated picture of an appliance's digital display where, the locations and semantics of the each of the components in the display are known. To illustrate this concept, refer to the following figure. The picture of the thermostat, without any of the annotations - the texts and boxes - is the reference image. The cartesian-coordinates of the four boxes that surround the temperature and humidity are also known. And the semantics of the display is that the temperature can be interpreted from the two boxes on the left while the humidity is from the two boxes on the right.




So how is a reference image used? A picture of the appliance that the user takes is transformed to look like the reference image, as shown on the following figure. Homography is used for this transformation. Homography matrix is a matrix of size 3x3, which when applied onto the user-taken image transforms it as if it is taken from the reference image's perspective.


The purpose of this transformation is to remove the "skew" out of the picture that is taken by our visually-impaired user. Without this, things like translations, rotations, scaling, and wrapping would have to be accommodated when interpreting the digits on the display. On the other hand, by applying the transformation, the locations of the boxes that surround the temperature and humidity digits are immediately known, therefore reduces the complexity needed for interpreting the display.

With the information provided from the reference image, we have sufficient information needed to interpret the picture taken. First, we crop out the individual regions that make up the digits of the temperature and humidity. After that, we interpret each of the regions using machine learning. And finally, we verbalize the result of interpretation and speak it out to the user using text-to-speech engine.





Technical Details

In this section, I'm going to provide more of the theories, details, and challenges that we came across when we worked on the project. Although the project is conceptually pretty straight-forward, the implementation wasn't as direct. Most of the challenges came from the "uncertain-nature" of the computer vision and machine learning problems. More concretely, they are about getting the computer vision and machine learning algorithms to work robustly given the constraints that we had. And for these, we had to spend significant efforts trying to understand the root-cause of the problems, which frequently required us to write separate programs to visually-inspect the problems.

The technologies used are Android smartphones, OpenCV computer vision library, and deep learning with Caffe frameworks. Android was used because it is the mobile programming platform that we were the most familiar with. And as computer vision is a pretty complex field by itself, we used OpenCV, which was one of the most popular open-source computer vision library, so that we wouldn't have to implement many of the algorithms on our own. Furthermore, OpenCV was also supported on different platforms such as Windows, Mac, Linux, Android, and iPhone. Finally, deep learning method is used as it was recently found to work really well on computer vision problems similar to what Appliance Reader was trying to solve. Deep learning bested the state-of-art approaches on problems such as image-recognition.

Instead of immediately prototyping in Android phone, we mostly worked on Unix-based desktop. The thought is that debugging in a desktop is a lot more convenient than in a phone. And to ensure that the program would run on the phone, we occasionally ported our progresses onto it. For example, this one is a program that does homography transformation on Mac OSX, while this one is the Android port of it. The prototyping was done in different ways. The real-time one was to have an Android phone to capture the picture of an appliance, which was then immediately sent to the desktop for processing, via TCP connection. And an offline way was to capture a number of pictures - simulating how real users would use it - with the phone and then kept using all the pictures for different experiments on the desktop.

Homography transformation
To transform the user-taken image to look like the reference image, we use homography transformation, which is also known as projective transformation. Homography transformation is done by applying a particular 3x3 matrix to the user-taken image. When compared to the well-known affine transformation, homography has two more degrees of freedom, hence it has 8 degrees of freedom in total. Since homography transformation has been successfully used in 3D reconstruction problems, such as reconstructing 3D model from 2D images and projecting 3D points onto 2D space, this transformation is naturally a good candidate to solve our problem.

This transformation is used to reduce the complexity of the interpretation (machine learning) problem, in both of the training and classification phases. To illustrate this, consider taking a picture of an appliance and then using it for interpretation. The size of the digital display on the picture that is just taken would depend on the distance of the camera from the appliance. When the distance is shorter the appliance would appear bigger, and so would the symbols on the display that need to be interpreted by using machine learning. Without the transformation the classification needs to be done across different scales, which therefore increases the complexity of the interpretation. In addition to that, the complexity needed for training the machine learning model would also increase as the model needs to be able to deal with more varieties of negative samples from different scales. Moreover, other factors such as rotation, translation, and skew also need to be considered similarly.

1. Correspondences
Given a picture of an appliance that is to be transformed, we first need to compute its correspondences with the reference image. Correspondences are how the points on the picture map onto the reference image, so it essentially is a 2D -> 2D mapping. These correspondences are then used to solve create the aforementioned homography matrix. Math-wise, only four correspondences are needed to create the matrix. 

2. SIFT While we can compute correspondences by hands fairly easily, how do compute them programmatically? We computed SIFT features for both the picture and reference-image and used RANSAC estimation algorithm to approximate correspondences from those computed features. Features are basically points that are considered the "key-features" of an image. A common example is an edge, and to illustrate why an edge is interesting, consider a triangle and a rectangle. A triangle is uniquely-characterized by its three edges and a rectangle is similarly by its four edges. SIFT feature has been proven to work very well, mainly due to its invariances towards transformations such as translation, rotation, and scaling, and also partially towards geometric distortions and illuminations. What it means is that the same features would be extracted (to a certain confidence) out of versions of an image where the combinations of the aforementioned transformations are applied onto. As for Appliance Reader, the reference-image and the user-taken-image are both of the same appliance, so the expectation is that same features would be extracted from both, by using SIFT extraction algorithm.

3. RANSAC Now that we have features computed from reference image and user-taken image, we find the match between the two sets of features. For this, we used "knnMatch" algorithm, which finds k best matches for each of the features. These matches are not accurate because they are just estimated from their mathematical distances. Now RANSAC is used to compute the correspondences. What it does is iterating on randomly picking a subset of the matches, computing a homography matrix out of them, and testing the accuracy of the matrix, in order to find a good homography matrix. How good the matrix is, is determined by the number of matches that agree with the transformation. The more it is, the better the matrix is. 

Challenges:
a. Some appliances were so simple in appearances, that the number of features computed is insufficient to get good homography result.

To solve this problem, we put "interesting" stickers on the appliances. For example, the tiger and elephant stickers on the following appliance:



b. Certain appliance/background has too many "similar" features computed out of it, making the combination of feature-matching ("knnMatch" mentioned before) and RANSAC estimation ineffective.

Since the RANSAC step relies on the assumption that there are "good" matches out of the feature-matching, the pipeline didn't work well on appliance/background that has too many "similar" features. Some examples of this kind of appliance/background are carpets and speakers that have lots of small holes. 

To alleviate this problem, we ended up using one of the tricks mentioned on David Lowe's original paper about SIFT. The trick is that, for each of the feature matches, compute two-best matches instead of just one (hence, we used "knnMatch"). And only if the two best matches are far-apart, we use it on the proceeding RANSAC step. The key intuition here is that matches that are not "unique" enough are not good matches.

void CVHelper::featureMatch(const Mat& queryDesc, const Mat& trainDesc, vector<DMatch> &matches) {
    BFMatcher matcher;
    vector<vector<DMatch> > tempMatches;
    matcher.knnMatch(queryDesc, trainDesc, tempMatches, 2);
    const float ratio = 0.8; // As in Lowe's paper; can be tuned
    for (int i = 0; i < tempMatches.size(); ++i) {
        if (tempMatches[i][0].distance < ratio * tempMatches[i][1].distance)
        {
            matches.push_back(tempMatches[i][0]);
        }
    }


c. How to check that a the homography transformation was successful?

After applying the tricks above, we had a pretty decent success rate on transforming user-taken picture to look like the reference-image. Nevertheless, the transformation still sometimes failed, for example due to the picture taken being too blurry (e.g. the phone shook while taking the picture) or really bad illumination. And so the question is, how do we know when we need to take another picture?

An idea is to use inverse-transformation. So after the user-taken picture is transformed, we can try to transform it back to look like the original and programmatically check if it is close enough with the original. 

Interpretation



TODO: finish this....

No comments:

Post a Comment