Tuesday, August 9, 2011

Steal this database? Don't mind if I do.

A while back, Meetup.com issued an pseudo-challenge: steal their database.  Nothing that would result in the FBI knocking on your door mind you, but a look into their streaming API.  Meetup.com streams all their public events and RSVPs via HTTP streaming or HTML5 websockets, so all the is required to steal their database is a connection to a stream and the ability to save the content.

Their simple example python/sqlite script proves how simple it is to capture this data and save it for your own purposes.  As someone that loves data, my initial instinct was to save it myself and start mining for data.  To start, the default search on Meetup.com appears to not search specific meetups, just the overall description.  Searching this data would be an improvement on the existing system.

One of my favorite software products, elasticsearch, supports a concept called rivers, which is a pluggable service for indexing content from an external source either via pull or push.  Elasticsearch comes packaged with a plugin for indexing Twitter content via their API.  Adapting this code for the Meetup.com seemed almost trivial.  After looking at the differences in APIs, it became apparent that a more flexible solution was possible.

Enter streaming-river, a simple elasticsearch plugin for indexing any JSON content that is streamed via HTTP or HTML5 Websockets.  In the simplest case, all that is required is the URL of the streaming service.  With elasticsearch's dynamic mappings, the server will automatically map all properties that have not been explicitly mapped.  Of course, not all content is suitable for dynamic mappings, particularly any text content that should not be analzyed since the default behavior  is for all string fields to be analzyed.  Mapping is done using the existing elasticsearch syntax.

Meetup.com does not provide a full database dump, so any data mining on the content will happen only on the content indexed when a new river is added.

Onto the data!  Since I am the organizer of the NoSQL NYC meetup, I am always interested in other meetups worldwide.  Meetup.com allows finding related meetups using tags (http://nosql.meetup.com/), however that only finds meetups that use that specific tag. Not good enough.  What about groups that are not solely about NoSQL but have events with NoSQL in the description?

curl -XGET http://localhost:9200/meetup/event/_search?q=description:nosql


The results on my test data look as following (after lots of sniping for clarity):
 {  
  "took": 3,  
  "timed_out": false,  
  "_shards": {  
   "total": 5,  
   "successful": 5,  
   "failed": 0  
  },  
  "hits": {  
   "total": 14,  
   "max_score": 0.9194888,  
   "hits": [  
    {  
     "_index": "meetup",  
     "_type": "event",  
     "_id": "25509441",  
     "_score": 0.9194888,  
     "_source": {  
      "rsvp_limit": 80,  
      "status": "upcoming",  
      "maybe_rsvp_count": 0,  
      "payment_required": "0",  
      "mtime": 1312830490616,  
      "venue": {  
       "zip": "94103",  
       "phone": "415738330",  
       "lon": -122.406745,  
       "name": "Microsoft",  
       "state": "CA",  
       "address_1": "835 Market St Ste 700",  
       "lat": 37.785044,  
       "city": "San Francisco",  
       "country": "us"  
      },  
      "id": "25509441",  
      "utc_offset": -25200000,  
      "time": 1312936200000,  
      "venue_visibility": "public",  
      "yes_rsvp_count": 80,  
      "event_url": "http:\/\/www.sfphp.org\/events\/25509441\/",  
      "description": "<p>We will be doing a \n<a href=\"http:\/\/en.wikipedia.org\/wiki\/NoSQL\">NoSQL<\/a> roundup with \n<a href=\"http:\/\/www.mongodb.org\/\">MongoDB<\/a>, \n<a href=\"http:\/\/cassandra.apache.org\/\">Cassandra<\/a> and \n<a href=\"http:\/\/hadoop.apache.org\/\">Hadoop<\/a>. Doors open at 5:30 and the \n<strong>presentations will start at 6:30pm<\/strong>. There will be food and beverages provided by our sponsors.<\/p>\n<p>The goal of the event is to have three presenters, each given about 45-60 minutes, provide an overview and explain the benefits\/advantages and ideal use-cases of the particular NoSQL solution they are pitching.<\/p>\n<p>\n<strong>Please be aware<\/strong>, there is a new security policy at the Microsoft location that you need to be aware of. We need to provide a sign-in sheet to the security desk prior to the event and&nbsp;\n<strong>\n <em>all&nbsp;attendees MUST bring a photo ID that matches your RSVP name on&nbsp;meetup.com<\/em>\n<\/strong>. I know this is an&nbsp;inconvenience&nbsp;for some of you who use an alias on the site and I&nbsp;apologize&nbsp;for it. Without getting into the pros\/cons or aliases on the interwebs we need to respect the wishes\/policies of our hosts.<\/p>\n<p>1) SriSatish Ambati, who is \"a chief tinkerer of Java and Enterprise stacks for the Cassandra Company, \n<a href=\"http:\/\/www.datastax.com\/\">DataStax<\/a>\" to present on Cassandra<\/p>\n<p>2) William Shulman, who runs \n<a href=\"https:\/\/mongolab.com\">Mongo Lab<\/a>, a Cloud&ndash;Hosted MongoDB, will present on MongoDB<\/p>\n<p>3) Todd Lipcon, from&nbsp; \n<a href=\"http:\/\/www.cloudera.com\/\">Cloudera<\/a>, will present on Hadoop.<\/p>\n<p>&nbsp;<\/p>\n",  
      "name": "NoSQL Roundup",  
      "group": {  
       "id": 120903,  
       "group_lat": 37.79,  
       "name": "The SF PHP Meetup Group",  
       "group_lon": -122.4,  
       "join_mode": "open",  
       "urlname": "sf-php"  
      }  
     }     
    },  
    ...  
    ...  
   ]  
  }  
 }  

Right away, I was able to find a meetup about NoSQL by a non-NoSQL-focused group: http://www.sfphp.org/events/25509441/?eventId=25509441

Similar to Solr, elasticseach provides for faceted navigation.  Among many things, facets allow for aggregation queries such as GROUP BY in SQL.  We can use facets to find which meetup groups have the most meetups about nosql:
 curl -X GET "http://localhost:9200/meetup/event/_search?pretty=true" -d '  
 {  
   "query" : {  
     "match_all" : { }  
   },  
   "size" : 0,  
   "facets" : {  
     "topics" : {  
       "terms" : {  
         "field" : "group.urlname",  
         "size" : 10  
       },  
       "facet_filter" : {  
         "term" : { "description" : "nosql"}  
       }  
     }  
   }  
 }  
 '  

Moving onto the RSVP data, how many RSVPs are there per hour?
 curl -X GET "http://localhost:9200/meetup/rsvp/_search?pretty=true" -d '  
 {  
   "query" : {  
     "match_all" : { }  
   },  
   "size" : 0,  
   "facets" : {  
     "hours" : {  
       "date_histogram" : {  
         "field" : "mtime",  
         "interval" : "hour"  
       }  
     }  
   }  
 }  
 '  

Grab some data and find out the results. These queries are just the tip of the iceberg of what you can do with the elasticsearch and Meetup.com content.  Saving other streams is just as easy, the only slightly difficult part is creating a mapping if the dynamic mappings are not adequate for the searches that you require.


No comments:

Post a Comment