Furia-chan is a "Google" for binary programs. It can be used to streamline the process of detecting Libre/Open Source software license violations.


Furia-chan works in a way analogous to a web information retrieval program (Google, Yahoo). We take a binary program, and we cut it into pieces that we call fragments (for information retrieval programs these are words). In reality these fragments are trees of machine instructions generated from data flow analysis.

In the same way Google and Yahoo use stemming to "normalize" words to their roots, we use a tree distance function to be able to match slightly modified trees. By using this function we can normalize a program to its "base fragments". Once the program is normalized, we use standard information retrieval scoring techniques to match our binary programs, just as if we were matching a web page.

You can read some papers that talk about Furia-chan. I have also included the presentation slides given at the respective conferences:

A Tree Distance Function Based on Multi-sets. PAKDD 2008 (to appear in Springer's LNCS) Paper Presentation
Fast Approximate Matching of Programs for Protecting Libre/Open Source Software By Using Spatial Indexes. SCAM 2007 Paper Presentation
On Approximate Matching of Programs for Protecting Libre Software. Cascon 2006 Paper Presentation