Sunday, August 28, 2011

Trip Report: KDD 2011

The SIGKDD 2011 conference took place August 21 - 24 at the Hyatt Manchester in San Diego, CA.  Researchers from all over the world interested in knowledge discovery and data mining were in attendance.  This conference in particular has a heavy statistical analysis flavor and many presentations were math intensive.

I was invited to present my masters project research at the Mining Data Semantics (MDS2011) Workshop of KDD.  In this paper, we present an approach to find social media profiles of people from an organization.  This is possible due to the links created between members an organization. For instance, co-workers or students will likely friend each other creating hyperlinks between their respective accounts.  These links, if public, can be mined and used to disambiguate other profiles that may share the same names as those individuals we are searching for.  The following figure shows the amount of profiles found from the ODU Computer Science student body for each respective social media site and the links found between them.



This picture represents the actual students themselves and the links between them.  Black nodes are undergrads, green nodes are grads, and red nodes are members of the WS-DL research group.


These are the slides:
An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Here is the paper:
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles


View more documents from carlton.northern.
I've synopsized some of the interesting presentations from the conference:

Stephen Boyd - Stanford University "From Embedded Real-Time to Large-Scale Distributed".  Stephen Boyd's talk focused on his current research area of convex optimization.  He explained that convex optimization is a mathematical technique in which many complex problems of model fitting, resource allocation, engineering design, etc. can be transformed to a simple convex optimization problem to be solved and then transformed back into the original problem to get the solution.  He went on to explain how this can be implemented in real-time embedded systems sych as a hard disk drive head seek problem, to large distributed system such as California's power grid.

Amol Ghoting - IBM "NIMBLE: A Toolkit for the Implementation of Parallel Data Mining and Machine Learning Algorithms on MapReduce".  Use Hadoop to write a map function and a reduce function where you can map anything to a (key, value) pair.  The problem with Hadoop is that it has a two-stage data flow which can be cumbersome for programming.  Also, job scheduling and data mangement is handled by the user.  Lastly, code-reuse and portability is diminished.  This toolkit tries to make the key features of Hadoop available to developers but without a Hadoop specific implementation.  NIMBLE actually decouples algorithm computation from data management, parallel communications and control.  It does this through using a series of basic datasets and basic tasks that create a DAG.  Tasks can spawn other tasks.  With this structure in place, simultaneous data and tasks parallelism is achievable.

David Haussler – UC Santa Cruz “Cancer Genomics”.  DNA sequencing cost has reduced dramatically.  DNA sequencing was following Moore’s law but is now reducing cost 10 fold every two years.  Can now cheaply sequence entire genomes.  Created the Cancer Genome Atlas.  10,000 tumors will be sequenced in the next two years using this Atlas.  Cancer genome sequencing will soon be a standard clinical practice.  Because each persons DNA is different, and each tumor resulting from a persons DNA is different, a huge computational processing problem looms in the near distant future. 


Ahmed Metwally - Google. "Estimating the number of people behind an IP Address".  Most research assumes that there is 1 person using 1 IP address, but this is not the case.  IP's also change size of users, for instance, a hotel with a conference will have many more users possibly using the same IP address than usual.  So, how would one estimate the amount of these users in a non-intrusive way?  One method is to look at trusted cookie counts.  Another method is to look at diverse traffic.  Google caps traffic volume per IP to stop people from gaming the system using the same IP address.  Google knows how many users share an IP address because they are logged in with a username and password to Googles sites.  However, some of Googles traffic is from users that don't have a Google account.  This research is for those who want to filter users without asking them for any identification, thus preserving their privacy.  This method is currently being used at Google for determining click fraud.

D. Scully - Google "Detecting Adverserial Advertisements in the Wild".  An adversarial advertiser would be an advertiser that uses Google AdWords or AdSense to advertise misleading products like counterfeit goods or scams.  Most ads are good, only a small amount are bad.  Using in-house trained people to hand build rule based models.  Allowing these people to hand-build the rules gave a great incentive and improved morale rather than just having them do repetitive tasks over and over again.  Automated methods are being used as well, but this part of the presentation went right over my head.

Chunyu Luo - University of Tennessee "Enhanced Investment Decisions in P2P Lending: An Investor Composition Perspective".  In this paper, they are trying to decide which loans are worthwhile to invest, in other words, what makes a good loan?  Use a bipartite investment network with one side investors and the other investees and the edges between them loans.  Each loan can be considered a composition of many investors.  The idea is that by looking at the past performance of the other investors of a given loan, you can improve your prediction of the return rate for that loan.  Performed experiment from dataset of prosper.com.  The composition method far outperformed the average return of investment.

Susan Imberman - College of Staten Island "From Market Baskets to Mole Rats:  Using Data Mining Techniques to Analyze RFID Data Describing Laboratory Animal Behavior".  This paper presents the data mining techniques used in analyzing RFID data from a colony of Mole Rats.  Much like we use RFID in cars for tolls like EZ Pass, they are using RFID on Mole Rats and when they pass specific points of the colony (a series of pipes and rooms) they collect that sample.  They used k-means clustering which showed animal place preference.  Used an adjacency matrix to get an idea of which Mole Rats liked to be near one another.  This created 3 distinct sub graphs which corresponded well to the different colony structure of Mole Rats, queen workers, large workers and small workers.  Next they correlated common transactions made in the grocery store with items in a basket to repeat behavior of Mole Rats.

After the conference ended on Wednesday, Hurricane Irene was on track for a direct hit to Hampton Roads.  My flight was scheduled to arrive in Norfolk around 11pm which was cutting it very close to the storm hitting.  So I decided to extend the trip till Monday and ride out the storm here in sunny San Diego.  In total, I managed to miss a hurricane, a tornado, an earthquake, and a swamp fire.  I think I made a good decision...

No comments: