Saturday, October 24, 2015

Understanding how Ubuntu Touch is forked out of AOSP

This post is meant for readers who have familiarity with AOSP building. As I just got my new Nexus 5 x, as a side project, I decided to try to port ubuntu-touch to my old phone: HTC One m7.

First of, Ubuntu touch is built on top of AOSP project. It's using a fork of Android kernel, frameworks, and build system. More precisely, the diagram below explains the architecture:

(note that I didn't draw the diagram, credit goes to Ubuntu's official page: https://developer.ubuntu.com/en/start/ubuntu-for-devices/porting-new-device/)

In the past, Ubuntu Touch project based the Android code they're using of Cyanogen's fork of AOSP (http://cyanogenmod.org), however they recently moved to pure AOSP version. To be precise, by the time of this writing, they're using version 4.4.2.r1. As for my device, since there's a Cyanogen support for it, I think ubuntu-touch can be ported with reasonable effort.

I think using AOSP instead of building the project from scratch is a really smart move. I can imagine that not only it will speed up their development process, but they may also benefit from improvements that Google are periodically making to the code. And the neat thing about AOSP code is that, the Java portion of it is pretty-well isolated from the the build system, making it a perfect fit for Ubuntu Touch that doesn't rely on Java.

Before we begin exploring the changes that Ubuntu Touch made atop of AOSP, I'd like to give a crash intro towards AOSP. First, it is built on top of a modified Linux kernel. Some substantial modifications that are worth to mention are wakelock-support and binder-IPC. The first provides userspace a way to prevent Linux to go to deep-sleep state or LP0. Unlike desktop, when a mobile device is put to sleep (e.g. power button is pressed), we may not want it go to deep-sleep right away. Sometimes we want the sleep to be delayed until a certain task is completed. (e.g. until a data synchronization is completed or until an email is sent). And for binder, it is a lightweight IPC mechanism that is used all over Android. It provides the needed security and flexibility that are expected by a mobile application. For example, when an IPC occurs, the caller identity can be retrieved, which in turns allows an application to see whether the caller has the appropriate permission required for the request. And in terms of flexibility, Google has build a very convenient userspace frameworks on top of its kernel driver implementation. This frameworks allows IPC communication between Java applications, native applications, and even Java-native applications. From my little observation so far, Ubuntu Touch seems to also rely on wakelock and binder.

Now, lets go back to the changes towards AOSP that Ubuntu Touch made. When Android kernel completes its initialization, it calls the very first process, init. This process then does all the userspace initialization, such as creating sysfs nodes and settings proper permissions, and kick-starting different daemons. What's necessary to know here is that init is the one that starts a Java virtual machine that Android heavily uses. init forks itself and runs zygote, which handles all Java-related stuffs. And as Ubuntu Touch does not use Java, a modification was made so zygote never runs. For more details, refer to the code under phablet/system/core. In particular, look for the following commit:

commit 1742098c17be31d968dce45a8eda4552602398c7
Author: Ricardo Salveti de Araujo <ricardo.salveti@canonical.com>
Date:   Fri Jan 10 03:06:09 2014 -0200

    init.rc: adding sensorservice and ubuntuappmanager
    
    Also removing some unused services.
    
    Change-Id: I7881bf436319ed09cbac46450c51710b6cf93c11
    Signed-off-by: Ricardo Salveti de Araujo <ricardo.salveti@canonical.com>

[To be continued...]

1. upstart -> ./ubuntu/upstart-property-watcher/upstart-property-watcher.c:78:#define
http://upstart.ubuntu.com/

2. 

Friday, October 23, 2015

CUDNN Benchmark Tools

I was trying to evaluate whether my embedded device, NVIDIA Shield TV, was capable of running a particular CNN learning algorithm in real-time. Instead of directly implementing the entire learning algorithm with the risk of not being able to meet the targeted performance, I decided to study the feasibility first by writing a benchmark tool. It runs forward convolutions with different configurable parameters (number of feature maps, batch size, filter size) and algorithm.

For those who are not familiar with CUDNN library, it supports different algorithms for running a forward convolution; some try to minimize GPU memory footprint, while some other focus on performance without worrying about memory-footprint. Perhaps the idea is to support different GPUs with different specifications (number of CUDA cores, memory size, etc), as well as different use-cases.

Sample output:
n, w, h, c, k, filter_dim, avg_time(us), max_time(us)
1, 3840, 2160, 1, 1, 3, 74383, 75142
1, 3840, 2160, 1, 1, 5, 88465, 88819
1, 3840, 2160, 1, 1, 9, 159752, 160324
Total time taken=322600 us.

Since the output is comma-separated, they can easily be parsed by spreadsheet processor. 

Check the code out here:
https://github.com/blacksoil/CUDNN-BenchmarkUtility

Thursday, October 22, 2015

Ubuntu 14.04 suddenly won't boot? This is how you debug it!



After a couple days without rebooting my X201 that was running Ubuntu, I decided to reboot the system. And unfortunately, it won't boot afterwards! After passing the Ubuntu's familiar purplish-black blank background, it got stuck in a black screen. What scared me were that pressing caps-lock didn't even turn on the keyboard indicator and I also couldn't enter "console mode" (ctrl + alt + f1). I was scared that the system went into a hard-hang, perhaps because I accidentally installed some craps. However, it turned out that my xserver binary was missing; the file /usr/bin/X didn't exist, for some reason! How did I figure that out? This post explains it.

Ubuntu keeps different the logs of the last boot attempt. These logs can be used to triage booting issue and potentially are useful to fixing it too. It Through these logs, I was able to figure out my xserver issue. This is how you do it:
1. Boot the system, get it to the hanging state. This step is necessary in order to get the log files generated.
2. Boot to a live-session using Ubuntu CD or Ubuntu USB stick
3. Open "[path to partition]/var/log" of the partition where the problematic Ubuntu is installed. Note that this is not just "/var/log" because that belongs to the currently running live session.
4. Sort the log files by modified date, and open the logs that correspond to the booting attempt from step 1.
5. As there are different log files, I'd suggest to first look at boot.log, which is similar to the kernel log that you get from dmesg. If nothing is suspicious there, then it's likely to be a userspace issue.
6. In my case, I found my xserver problem through the log at /var/log/lightdm/lightdm.log. This is the suspicious snippet of the logs:
+0.05s] DEBUG: DisplayServer x-0: Logging to /var/log/lightdm/x-0.log
[+0.05s] DEBUG: DisplayServer x-0: Can't launch X server X -core, not found in path
[+0.05s] DEBUG: DisplayServer x-0: X server stopped

Through a quick google search, I figured that /usr/bin/X is where "X server X -core" is supposed to be located at. In my case, the file was missing, so it was rather obvious. I then rebooted to Ubuntu recovery mode (by holding "shift" right after the system passes bios), and reinstalled xserver via sudo apt-get install --reinstall xserver-xorg.

Hope this post can be useful!

Monday, October 19, 2015

Qualcomm's thermal management file on Android devices.

I was reading through ubuntu-touch source code, in particular the /device/lge/mako repo, which is the configuration for Nexus 5 device, I came across thermald-mako.conf file. It looks like the following:

sampling         5000

[tsens_tz_sensor10]

sampling         5000
thresholds       60      120
thresholds_clr   57      115
actions          none    shutdown
action_info      0       5000

[batt_therm]

sampling         5000
thresholds       360                       370                       380                       390                       410                       420                       450
thresholds_clr   350                       360                       370                       380                       400                       410                       440
actions          cpu+gpu+battery           cpu+gpu+battery           cpu+gpu+battery           cpu+gpu+battery           cpu+gpu+battery           cpu+gpu+battery           cpu+gpu+battery
action_info      1512000+400000000+240+0   1296000+325000000+215+0   1296000+325000000+192+0   1188000+200000000+181+1   1188000+200000000+181+1   1188000+200000000+181+2   1188000+200000000+181+3

And from a quick google search, I found out (from an xda-developers thread) that the file is used to control thermal throttling based on the sysfs nodes here:
cat /sys/devices/virtual/thermald/thermald_zone*/temp

The number after tsens_tz_sensor is matched to the * on the path. The value read there is then used for thresholding.

How did I figure it's Qualcomm's? When I did a search over thermald on the repo, I found these:

aharijanto@aharijanto-ThinkPad-X201:~/droid/ubuntu/device/lge/mako$ g thermald
device.mk:60:   device/lge/mako/thermald-mako.conf:system/etc/thermald.conf
init.mako.rc:214:    # communicate with mpdecision and thermald
init.mako.rc:386:service thermald /system/bin/thermald
proprietary-blobs.txt:37:/system/bin/thermald
self-extractors/qcom/staging/device-partial.mk:37:    vendor/qcom/mako/proprietary/thermald:system/bin/thermald:qcom \
self-extractors/extract-lists.txt:42:            system/bin/thermald \
self-extractors/generate-packages.sh:104:            system/bin/thermald \
vendor_owner_info.txt:152:system/bin/thermald:qcom

Saturday, October 10, 2015

Convolutional Neural Net (CNN) vs. Feed-foward Neural Net vs. Human Visual-Processing System.

Recently, I was working with on a digit-classification task. I trained a variant of LeNet CNN network on a modified MNIST dataset. The modification is the addition of non-digits samples, which are augmented as the following:
1. For each of the images in the dataset, create four clones, each shifted by 50% upward, downward, to the left, and to the right, respectively.
2. Each of the four images generated from (1) is translated again. An image that from step (1) was translated upward/downward is now shifted randomly (0-25%) to the right/left. And similarly, the one that was translated to the left/right from (1) is now shifted upward/downward.

The network itself is modified to output one more digit, which is used to classify a non-digit input. All the generated non-digits samples explained above are labeled with this digit.

[Post the images that are falsely classified as digit here]

When used on an image of properly aligned digit, the trained classifier worked pretty well. When it was used in a sliding-window fashion, though, the result was disappointing. There are still so many false positives. It incorrectly classified what obviously seemed like non-digits, as digits. I suspected CNN architecture to be the culprit here. My thought is that since a convolution layer "combines" neighboring pixels together, multiple convolution layers followed by a fully-connected layer would "combine" more than just neighboring pixels, resulting in a potentially unexpected connectivity between pixels, and hence an unexpected activation.

After that, I was wondering if feed-forward neural net would produce better result when used in a sliding window fashion. My thinking is that since its connectivity is more straightforward, its activation units would require more "exact" matches to trigger, just as how it is less robust to translations when compared to CNN, and hence would be more robust to negative samples. When I get a chance, I'd really like to experiment with this. I'm thinking of deep visualization technique to analyse the robustness of CNN vs. feed-forward neural net against negative samples.

Also, it's worth to mention that I came across End-to-End Text Recognition with Convolutional Neural Networks [Wang, et al], which talks about text recognition using CNN. What's interesting is that they actually train a dedicated classifier, which is used in a sliding-window fashion, to detect whether a region contains a centered text or not.  Only after the region is determined, they run another character-classifier that actually predicts the text there.

Another interesting finding is about how a deep neural network can be fooled. On their paper, Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images, Nguyen, et al found that one can fool the state-of-art of pattern-recognition DNN with an augmented image (http://www.evolvingai.org/fooling). Their paper shows how an image can be augmented in a way that a DNN-based classifier would classify it to be of a particular pattern with a very high accuracy. The image can be assembled such that human would still be able to make out the classifier-claimed pattern or one that totally makes no sense (false-positive).

The need of explicitly training NN networks against negative samples got me thinking about how our (human) visual processing system works. It seems that we can make out a particular shape out of very complex background very easily. I'm interested to understand how our brain is actually achieving this. With this understanding, we'll be able to come up with a better architecture to tackle recognition problem, as well as to simplify the tedious process of gathering the training data. While gathering enough positive samples - in order to train a classifier that works reasonably well - is already a tedious work, gathering negative samples is even more challenging. If the size of the positive sample in the universe of all samples is X, the size of the negative counterpart is the size of the entire universe subtracted by X. Imagine how large it is!


Friday, October 9, 2015

Active Learning as a Way to Aid Object Labeling.

I recently got my very first introduction to Active Learning, through reading Active Learning for Visual Object Recognition [Feund, et al]. At its core, this paper talks about Active Learning as a way to aid the tedious process of labeling objects within an image (for visual recognition purpose). Specifically, the paper talks about pedestrian detection where the input dataset comes from video-stream recorded through camera attached on a car being driven. Input images are then extracted from the video-stream's frames.

The labeling part is clever. Instead of manually specifying all the regions where the pedestrians are located, the idea is to use a classifier, which at the same time is being trained, to help with the process. Here's the high-level idea:
1. Split the input images into a number of sets.
1. Pick one of the sets and manually label the regions with pedestrians.
2. Train a pedestrian-classifier using the labeled set.
3. Pick a different set of unlabeled input images, then label them using the previously trained classifier.
4. Re-train the classifier with all the sets of labeled images.
5. Repeat step (3) and (4) until all sets are labeled.

The key idea here is that manually specifying a region (e.g. using a mouse to carefully draw a bounding box) of pedestrian is significantly more time consuming (~20 seconds) than marking whether a region specified by the classifier is a true positive or a false negative (~3 seconds). And as the classifier undergoes more training, its accuracy also increases, which in turn reduces the number of false negatives, and ultimately speeds up the labeling process. Interesting, isn't it? :)

Wednesday, October 7, 2015

BlimpBot

Appliance Reader


Background

What is Appliance Reader?
As a blind person, Jane relies on braille stickers to use her appliances: oven, stove, microwave, and laundry machines. She puts the stickers on each of the appliance's buttons to identify its functionality. Furthermore, since she can't use the stickers to get the appliance's feedbacks, such as how many more minutes until the cookies are baked or until her laundries are done, she has to specifically buy appliances with physical feedback, such as a laundry machine that is controlled by a dial-based timer. However, this type of appliances is becoming obsolete and hard to find, as digital display-based ones are becoming more ubiquitous. Appliance Reader project aims to help people like Jane reading digital displays.

What device does Appliance Reader run on?
People with visual-impairment often rely on very special hardware to do certain tasks. For example, gadget like BrailleNote [1] is commonly used to take notes because they provide convenient and accessible ways of typing and reading. However, such specialized gadgets are often very expensive due to their relatively small niche-market. By the time of this writing, BrailleNote costs up to a few thousands of dollars. On the other hand, smartphones, which are much more affordable due to being mass-produced, are becoming a more popular substitution for such specialized gadgets; features such as text-to-speech, voice over, and optical character recognition have made smartphone accessible for people who suffer from visual-impairments. Appliance Reader is a smartphone application that uses the phone's camera to read appliance’s digital display. User takes pictures/video streams of the appliance’s digital display and the phone would interpret and speak out the reading.

Who worked on this project?
This project first started at Prof. Richard Ladner's Mobile Accessibility lab, which is one of the research groups at the University of Washington Computer Science and Engineering department. The people who originally worked on this project were Prof. Richard Ladner (CSE Professor at UW), Bryan Rusell (Computer Vision Researcher at UW), Antonius Denny Harijanto (ex-undergrad student), Michael Hotan (ex-undergrad student). During our undergraduate years, Mike and I started to work on this project from scratch, with close guidance from Prof. Ladner and Bryan. Bryan taught us a lot of computer vision concepts that were necessary for this.

We worked on this from 2012 until Mike and I graduated from in 2013. Although we made a significant progress of realizing what initially was only an idea into partially-working implementations, we weren't able to fully finished the project by the time we graduated. Machine learning portion of this project - for the interpretation phase - was the only significant piece left to be completed. 

As an effort to study machine learning, I continued working on this project during my spare time in 2015. As I feel that the project is now very close to a completion, I'm writing this post to share what I've learned. I'm hoping that this post will be beneficial! :)


Challenges

First thought.
When we first started the project, we thought that problem was pretty straight forward: read, interpret, and then present. More concretely, we were thinking of reading an appliance's digital display by running a publicly available OCR engine over the camera-taken pictures, interpreting the results as a text using appliance-specific rules, and then reading it out loud using text-to-speech engine. However, we found the problem to be pretty complex and so we decided to create a system that uses computer vision and machine learning to solve it.

What's the problem with OCR?
After we experimented with Tesseract, which was considered to be one of the most accurate OCR engines publicly available at the time, we realized that OCR would not work well for our purposes. From our research and experiments we came up with several hypothesis that explain this. First, OCR engine is generally designed to read scanned text documents and so it makes use of common characteristics of the documents, such as text alignments, uniformity in background color (methods used by top scorer of ICDAR 2015 competition [3]), and uniformity in the font sizes across the documents. As an example, [X from paper Y][1] proposes a method to retrieve bounding boxes of text regions by scanning through a document with a downward-moving horizontal line that figures out the density of the pixels that each of the lines crosses. These densities are then used to predict the texts' locations. This heuristic sounds very reasonable for documents with lots of properly aligned texts, but not for appliances' digital displays which usually consist of only a handful of characters. Secondly, appliances' digital displays often use uncommon font types and non-alphanumeric symbols. And since to support these an OCR engine is required to undergo additional training anyway, designing our own trainable system would give us more flexibility and control. Thirdly, pictures of the digital displays that are captured by people with visual-impairment would be skewed since it might be hard for them to angle the camera perfectly; OCR doesn't work well on skewed pictures since it is designed to interpret documents with good perspective (e.g. scanned documents). Fourthly, since the pictures are taken via smartphones' cameras, they may have different varieties of lighting and may suffer from reflections due to the digital displays' glossy surfaces. And once again, these problems are not faced by the kind of documents where OCR is usually run on. In our experiments, even on carefully taken pictures that are taken perpendicularly to the display's plane, have good lightning, and have no glares, it was still hard to get good OCR results.


Here are some appliance pictures that illustrate the issues described:


Notice how bad the reflections are, and that "Energy Saver" and "Cubed" are indicated by unusual symbols that any off-the-shelve OCR engines wouldn't recognize.


Notice the rotation on the picture and how the font sizes are not uniform.

[TODO: Put link to OCR experiment]


High-Level Concepts of Our System

Constraints
After we decided that OCR was not a good choice for our purpose, we started designing our own system that uses computer vision and machine learning. We started by deciding the system's constraints, which are the result of considering the specification of Appliance Reader: hardware where the application is run on, desired user-interface, and general characteristics of appliances' digital displays. These are the constraints the came up with:

1. Lightweight algorithms.
As explained earlier, in order to make Appliance Reader cost-effective, we wanted it to run on smartphones. At the time of initial research, 2012, smartphones were pretty limited in terms of processing power; therefore, the algorithms to be used couldn't be too heavy that they wouldn't run on smartphones.

2. Work on different appliances.
One of the goals of Appliance Reader was to support as many appliances as possible. This meant that the system needed to be extensible; adding support to a new appliance should not require too much efforts.

3. User should not need to modify their appliances.
Complex modifications such as to an appliance's electrical hardware should be avoided. 

User-Interface Design
Based on the constraints above, we came up with the following flow of a smartphone application:

1. Selecting an appliance
When the app is just opened, there is a list of supported appliances. User navigates through different appliances by scrolling his/her finger over a list. And as the user swipes, voice-over system will speak out the appliance that the finger is touching. To confirm a selection, user double taps the screen.

2. Taking a picture
After an appliance is selected, the application goes into the picture-taking mode, where camera-stream is opened. Depending on how powerful the device running the application is, there are two possible user interfaces. The first, which is dedicated for less powerful device, requires the user to tap on the screen when he/she thinks that the appliance's digital display is within the camera's field of view. The second, which is for more powerful device, doesn't require the user to tap on the screen; he/she only needs to point the camera onto the appliance's digital display. The second approach is computationally heavier because a picture needs to be taken and processed continually. In this writing, we use the first approach.

3. Processing the picture
After the user taps on the screen, Appliance Reader processes the picture on the background. Meanwhile, user is asked to wait while a progress-bar is shown on the screen.

4. Interpreting the result
If the interpretation can be made out of the picture taken from (3), the result would be verbalized to the user using text-to-speech engine. If, however, the interpretation fails, for example due to the picture being too blurry or due to the digital display being cropped off, user is asked to take the picture again, so we'd go back to (2).

The General Idea of Our System
The hardest part of the computer vision and machine learning aspects of this project is in in their details. The main concepts are fairly simple, but the practice is pretty challenging as there are many unexpected problems along the way. Instead of describing all the nitty-gritty details immediately, I'll first illustrate only the high-level concepts. The implementation details and challenges will be explained in the next section.

Reference Image
After the high-level design of the user interface is decided, we came up with our main idea of this project, which is to have a reference image for each of the supported appliances. Reference image is an annotated picture of an appliance's digital display where, the locations and semantics of the each of the components in the display are known. To illustrate this concept, refer to the following figure. The picture of the thermostat, without any of the annotations - the texts and boxes - is the reference image. The cartesian-coordinates of the four boxes that surround the temperature and humidity are also known. And the semantics of the display is that the temperature can be interpreted from the two boxes on the left while the humidity is from the two boxes on the right.




So how is a reference image used? A picture of the appliance that the user takes is transformed to look like the reference image, as shown on the following figure. Homography is used for this transformation. Homography matrix is a matrix of size 3x3, which when applied onto the user-taken image transforms it as if it is taken from the reference image's perspective.


The purpose of this transformation is to remove the "skew" out of the picture that is taken by our visually-impaired user. Without this, things like translations, rotations, scaling, and wrapping would have to be accommodated when interpreting the digits on the display. On the other hand, by applying the transformation, the locations of the boxes that surround the temperature and humidity digits are immediately known, therefore reduces the complexity needed for interpreting the display.

With the information provided from the reference image, we have sufficient information needed to interpret the picture taken. First, we crop out the individual regions that make up the digits of the temperature and humidity. After that, we interpret each of the regions using machine learning. And finally, we verbalize the result of interpretation and speak it out to the user using text-to-speech engine.





Technical Details

In this section, I'm going to provide more of the theories, details, and challenges that we came across when we worked on the project. Although the project is conceptually pretty straight-forward, the implementation wasn't as direct. Most of the challenges came from the "uncertain-nature" of the computer vision and machine learning problems. More concretely, they are about getting the computer vision and machine learning algorithms to work robustly given the constraints that we had. And for these, we had to spend significant efforts trying to understand the root-cause of the problems, which frequently required us to write separate programs to visually-inspect the problems.

The technologies used are Android smartphones, OpenCV computer vision library, and deep learning with Caffe frameworks. Android was used because it is the mobile programming platform that we were the most familiar with. And as computer vision is a pretty complex field by itself, we used OpenCV, which was one of the most popular open-source computer vision library, so that we wouldn't have to implement many of the algorithms on our own. Furthermore, OpenCV was also supported on different platforms such as Windows, Mac, Linux, Android, and iPhone. Finally, deep learning method is used as it was recently found to work really well on computer vision problems similar to what Appliance Reader was trying to solve. Deep learning bested the state-of-art approaches on problems such as image-recognition.

Instead of immediately prototyping in Android phone, we mostly worked on Unix-based desktop. The thought is that debugging in a desktop is a lot more convenient than in a phone. And to ensure that the program would run on the phone, we occasionally ported our progresses onto it. For example, this one is a program that does homography transformation on Mac OSX, while this one is the Android port of it. The prototyping was done in different ways. The real-time one was to have an Android phone to capture the picture of an appliance, which was then immediately sent to the desktop for processing, via TCP connection. And an offline way was to capture a number of pictures - simulating how real users would use it - with the phone and then kept using all the pictures for different experiments on the desktop.

Homography transformation
To transform the user-taken image to look like the reference image, we use homography transformation, which is also known as projective transformation. Homography transformation is done by applying a particular 3x3 matrix to the user-taken image. When compared to the well-known affine transformation, homography has two more degrees of freedom, hence it has 8 degrees of freedom in total. Since homography transformation has been successfully used in 3D reconstruction problems, such as reconstructing 3D model from 2D images and projecting 3D points onto 2D space, this transformation is naturally a good candidate to solve our problem.

This transformation is used to reduce the complexity of the interpretation (machine learning) problem, in both of the training and classification phases. To illustrate this, consider taking a picture of an appliance and then using it for interpretation. The size of the digital display on the picture that is just taken would depend on the distance of the camera from the appliance. When the distance is shorter the appliance would appear bigger, and so would the symbols on the display that need to be interpreted by using machine learning. Without the transformation the classification needs to be done across different scales, which therefore increases the complexity of the interpretation. In addition to that, the complexity needed for training the machine learning model would also increase as the model needs to be able to deal with more varieties of negative samples from different scales. Moreover, other factors such as rotation, translation, and skew also need to be considered similarly.

1. Correspondences
Given a picture of an appliance that is to be transformed, we first need to compute its correspondences with the reference image. Correspondences are how the points on the picture map onto the reference image, so it essentially is a 2D -> 2D mapping. These correspondences are then used to solve create the aforementioned homography matrix. Math-wise, only four correspondences are needed to create the matrix. 

2. SIFT While we can compute correspondences by hands fairly easily, how do compute them programmatically? We computed SIFT features for both the picture and reference-image and used RANSAC estimation algorithm to approximate correspondences from those computed features. Features are basically points that are considered the "key-features" of an image. A common example is an edge, and to illustrate why an edge is interesting, consider a triangle and a rectangle. A triangle is uniquely-characterized by its three edges and a rectangle is similarly by its four edges. SIFT feature has been proven to work very well, mainly due to its invariances towards transformations such as translation, rotation, and scaling, and also partially towards geometric distortions and illuminations. What it means is that the same features would be extracted (to a certain confidence) out of versions of an image where the combinations of the aforementioned transformations are applied onto. As for Appliance Reader, the reference-image and the user-taken-image are both of the same appliance, so the expectation is that same features would be extracted from both, by using SIFT extraction algorithm.

3. RANSAC Now that we have features computed from reference image and user-taken image, we find the match between the two sets of features. For this, we used "knnMatch" algorithm, which finds k best matches for each of the features. These matches are not accurate because they are just estimated from their mathematical distances. Now RANSAC is used to compute the correspondences. What it does is iterating on randomly picking a subset of the matches, computing a homography matrix out of them, and testing the accuracy of the matrix, in order to find a good homography matrix. How good the matrix is, is determined by the number of matches that agree with the transformation. The more it is, the better the matrix is. 

Challenges:
a. Some appliances were so simple in appearances, that the number of features computed is insufficient to get good homography result.

To solve this problem, we put "interesting" stickers on the appliances. For example, the tiger and elephant stickers on the following appliance:



b. Certain appliance/background has too many "similar" features computed out of it, making the combination of feature-matching ("knnMatch" mentioned before) and RANSAC estimation ineffective.

Since the RANSAC step relies on the assumption that there are "good" matches out of the feature-matching, the pipeline didn't work well on appliance/background that has too many "similar" features. Some examples of this kind of appliance/background are carpets and speakers that have lots of small holes. 

To alleviate this problem, we ended up using one of the tricks mentioned on David Lowe's original paper about SIFT. The trick is that, for each of the feature matches, compute two-best matches instead of just one (hence, we used "knnMatch"). And only if the two best matches are far-apart, we use it on the proceeding RANSAC step. The key intuition here is that matches that are not "unique" enough are not good matches.

void CVHelper::featureMatch(const Mat& queryDesc, const Mat& trainDesc, vector<DMatch> &matches) {
    BFMatcher matcher;
    vector<vector<DMatch> > tempMatches;
    matcher.knnMatch(queryDesc, trainDesc, tempMatches, 2);
    const float ratio = 0.8; // As in Lowe's paper; can be tuned
    for (int i = 0; i < tempMatches.size(); ++i) {
        if (tempMatches[i][0].distance < ratio * tempMatches[i][1].distance)
        {
            matches.push_back(tempMatches[i][0]);
        }
    }


c. How to check that a the homography transformation was successful?

After applying the tricks above, we had a pretty decent success rate on transforming user-taken picture to look like the reference-image. Nevertheless, the transformation still sometimes failed, for example due to the picture taken being too blurry (e.g. the phone shook while taking the picture) or really bad illumination. And so the question is, how do we know when we need to take another picture?

An idea is to use inverse-transformation. So after the user-taken picture is transformed, we can try to transform it back to look like the original and programmatically check if it is close enough with the original. 

Interpretation



TODO: finish this....

Droid Synergy

Tuesday, October 6, 2015

Samsung SSD 850 EVO on ThinkPad X201

I had been contemplating about buying a new laptop to replace my 5 years old ThinkPad X201. My criterion were:
1. It has to run Linux (preferably Ubuntu, as I'm most familiar with this) well, like really well as it'll be my main OS.
2. Small (13" would be the largest)
3. Perhaps with NVIDIA GPU so I can run CUDA stuffs (as I've been into machine learning stuffs these days) on it and so some casual gaming.

Through focusing on criteria (1) and (2), I came across Dell XPS 13 Developer Edition, which is a laptop with native Ubuntu OS. Because it didn't come with Windows, guess what? It actually costed cheaper than the non-developer options that came with Windows! And the best part is that Ubuntu is optimized to run well on the laptop. Tempting. Very tempting. As finding a decent laptop to run Ubuntu had indeed been challenging, I drooled over the one with 256GB SSD and i7 CPU. But still I didn't buy it.

Thankfully, it didn't have an NVIDIA GPU! If only it had it, I'd have already bought it. So what's the deal? Well, I realized that my combo of X201 with Ubuntu had been serving me really well. There had almost been no time I needed more performance than what it offered (except, of course, for casual-gaming purposes, which I rarely do and I already have dedicated gaming devices anyway) So I thought: if it only had Intel's integrated GPU, how would it serve me better than my X201? Not much, really. Moreover, I already had a docking station for my X201! If I were to buy a new laptop, for sure I'd want to get a dock for it, and that'd be like $100+ extra. Bummer..


Long story short, instead of having abandoning it, I decided to give my Thinkpad X201 some love. I treated it with a 500GB Samsung SSD 850 EVO. How did it go? Really sweet! With Ubuntu 14.04 (w/o any form of disk encryption enabled), it booted in 12 seconds! Going to/resuming from hibernate was almost instantaneous! So far I've been loving it so much :)

Profile-based CPU controls to save power / improve perf

In order to save more power and hence improve on-battery life, in Ubuntu, CPU governor as well as clock speeds can be controlled more finely using cpufreqd. To illustrate how this can be useful, my ThinkPad X201 (running Ubuntu 14.04), always uses performance governor although the laptop isn't being plugged to wall-charger. What a waste of power! cpufreqd allows different profiles to be set,depending on the charging status or battery-level. In a particular profile, CPU governor as well as min/max CPU clock can be customized. Moreover, there's also indicator-cpufreq, which provides a nice status bar icon indicating current CPU configuration (both governor and speed).



Here's how to get them up and running:
1. sudo apt-get install indicator-cpufreq cpufreqd cpufrequtils
2. sudo vim /etc/cpufreqd.conf # This is the file to configure the profile

Below is snippet of what I modified (the rest stays default):

[Profile]
name=Performance High
minfreq=70%
maxfreq=100%
policy=performance
#exec_post=echo 8 > /proc/acpi/sony/brightness
[/Profile]

[Profile]
name=Performance Low
minfreq=60%
maxfreq=80%
policy=performance
[/Profile]


Furthermore, the followings are useful for debugging:
1. cpufreq-info. This can be used to see what governor and clock are currently running. Also useful to see whether the configuration selected via indicator-cpufreq (e.g. manually choosing a different governor) is applied fine. In the past, I've had issue where my CPU clock was locked to 1.2GHZ (this was back on Ubuntu 12.04, for further detail, see my other blog post)


2. 'cat /var/log/syslog', as cpufreqd logs are written onto this.

3. [Added 10/21/2015] If for some reasons, the rules written under /etc/cpufreqd.conf doesn't get applied, check if cpufreqd daemon is actually running. If it's not, you can start it manually through sudo '/etc/init.d/cpufreqd start'. On my Thinkpad X201, without cpufreqd running, its max frequency is somehow capped in a weird manner.

Sunday, October 4, 2015

Simple Key Re-Mapping on Ubuntu 14.04

This tutorial shows how a simple key remapping can be done using XKB, which is the standard keyboard mapper that Ubuntu 14.04 uses.

To give a little background, I have a ThinkPad X201. Around the laptop's arrow keys, there are two keys that act as BACK and FORWARD (as in the Internet-browser sense), which I found really redundant since I always use the combo of "ALT+LEFT" or "ALT+RIGHT" to achieve the same goal. And so I decided to remap these keys to "PG DOWN" and "PG UP" respectively.



As I researched around achieving this, I found that XKB is pretty complex. For example, it allows remapping a key to different keys, depending on the modifiers (e.g. A becomes B, while Alt+A would become Alt+C) While what I wanted was very simple, most of the resources I found online were pages long trying to explain the quirks of XKB. Now that I've learned how to achieve my remapping, I'm doing a brain-dump here :)

Goal: To map "BACK" and "FORWARD" to "Pg Down" and "Pg Up"
Steps:
  1. Figure out the keycodes for "BACK", "FORWARD", "Pg Down", and "Pg Up"
    • The codes can be found at /usr/share/X11/xkb/keycodes/evdev
    • They are: I166, I167, PGUP, and PGDN
  2. Figure out how the keycodes are used
    • Go to /usr/share/X11/xkb/
    • Grep for the keys above:
      • I166 and I167 are used by /usr/share/X11/xkb/symbols/inet, by default mapped to "XF86_Back" and "XF86_Forward"
      • PGUP and PGDN are used by many files. One of them is /usr/share/X11/xkb/symbols/pc, whic shows that they map to "Prior" and "Next"
  3. Customize the keycodes' mapping:
    • In accordance to the finding in (2), change "XF86_Back" and "XF86_Forward" that are used by I166 and I167 to "Prior" and "Next"
  4. Erase XKB cache files
    • Remove all *.xkm files under /var/lib/xkb/
  5. Reboot
  6. Enjoy!
This is a git-change for the above:

aharijanto@aharijanto-ThinkPad-X201:/usr/share/X11/xkb/symbols$ git diff
diff --git a/symbols/inet b/symbols/inet
index 5c4784e..8e522aa 100644
--- a/symbols/inet
+++ b/symbols/inet
@@ -144,8 +144,10 @@ xkb_symbols "evdev" {
     key <I163>   {      [ XF86Mail              ]       };
     key <I164>   {      [ XF86Favorites         ]       };
     key <I165>   {      [ XF86MyComputer        ]       };
-    key <I166>   {      [ XF86Back              ]       };
-    key <I167>   {      [ XF86Forward           ]       };
+//  key <I166>   {      [ XF86Back              ]       };
+    key <I166>   {      [ Prior                 ]       };
+//  key <I167>   {      [ XF86Forward           ]       };
+    key <I167>   {      [ Next                  ]       };
 //  key <I168>   {      [ ]       }; // KEY_CLOSECD (opposite of eject)
     key <I169>   {      [ XF86Eject             ]       };
     key <I170>   {      [ XF86Eject, XF86Eject  ]       };