Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More
Why Is Twitter All the Rage?
Most chapters won’t open with a reflective discussion, but since this is the first chapter of the book and introduces a social website that is often misunderstood, it seems appropriate to take a moment to examine Twitter at a fundamental level.
There are many ways to answer this question, but let’s consider it from an overarching angle that addresses some fundamental aspects of our shared humanity that any technology needs to account for in order to be useful and successful. After all, the purpose of technology is to enhance our human experience.
Fundamental Twitter Terminology
Twitter might be described as a realtime, highly social microblogging service that allows users to post short status updates, called tweets, that appear on timelines. Tweets may include one or more entities in their (currently) 280 characters of content and reference one or more places that map to locations in the real world. An understanding of users, tweets, and timelines is particularly essential to effective use of Twitter’s API, so a brief introduction to these fundamental concepts is in order before we interact with the API to fetch some data. We’ve largely discussed Twitter users and Twitter’s asymmetric following model for relationships thus far, so this section briefly introduces tweets and timelines in order to round out a general understanding of the Twitter platform.
Creating a Twitter API Connection
Twitter has taken great care to craft an elegantly simple RESTful API that is intuitive and easy to use. Even so, there are great libraries available to further mitigate the work involved in making API requests. A particularly beautiful Python package that wraps the Twitter API and mimics the public API semantics almost onetoone is twitter. Like most other Python packages, you can install it with pip by typing pip install twitter in a terminal.
Extracting Tweet Entities
Next, let’s distill the entities and the text of some tweets into a convenient data structure for further examination. Example 16 extracts the text, screen names, and hashtags from the tweets that are collected and introduces a Python idiom called a double (or nested) list comprehension. If you understand a (single) list comprehension, the code formatting should illustrate the double list comprehension as simply a collection of values that are derived from a nested loop as opposed to the results of a single loop. List comprehensions are particularly powerful because they usually yield substantial performance gains over nested lists and provide an intuitive (once you’re familiar with them) yet terse syntax.
Computing the Lexical Diversity of Tweets
A slightly more advanced measurement that involves calculating simple frequencies and can be applied to unstructured text is a metric called lexical diversity. Mathematically, this is an expression of the number of unique tokens in the text divided by the total number of tokens in the text, which are both elementary yet important metrics in and of themselves.
Visualizing Frequency Data with Histograms
A nice feature of the Jupyter Notebook is its ability to generate and insert highquality and customizable plots of data as part of an interactive workflow. In particular, the matplotlib package and other scientific computing tools that are available for the Jupyter Notebook are quite powerful and capable of generating complex figures with very little effort once you understand the basic workflows.
Last word
Although basic frequency analysis is simple, it is a powerful tool for your repertoire that shouldn’t be overlooked just because it’s so obvious; besides, many other advanced statistics depend on it. On the contrary, frequency analysis and measures such as lexical diversity should be employed early and often, for precisely the reason that doing so is so obvious and simple.