Exploring the LinkedIn API
You’ll need a LinkedIn account and a handful of connections in your professional network to follow along with this chapter’s examples in a meaningful way. If you don’t have a LinkedIn account, you can still apply the fundamental clustering techniques that you’ll learn about to other domains, but this chapter won’t be quite as engaging since you can’t follow along with the examples without your own LinkedIn data.
Making LinkedIn API Requests
Making LinkedIn API RequestsAs is the case with other social web properties, such as Twitter and Facebook (discussed in the preceding chapters), the first step involved in gaining API access to LinkedIn is to create an application. You’ll be able to create a sample application via the developer portal; you will want to take note of your application’s client ID and client secret; these are your authentication credentials, which you’ll use to programmatically access the API. Figure 41 illustrates the form that you’ll see once you have created an application.
Normalizing Data to Enable Analysis
As a necessary and helpful interlude toward building a working knowledge of clustering algorithms, let’s explore a few of the typical situations you may face in normalizing LinkedIn data. In this section, we’ll implement a common pattern for normalizing company names and job titles. As a more advanced exercise, we’ll also briefly divert and discuss the problem of disambiguating and geocoding geographic references from LinkedIn profile information.
NORMALIZING AND COUNTING JOB TITLES
As might be expected, the same problem that occurs with normalizing company names presents itself when considering job titles, except that it can get a lot messier because job titles are so much more variable.
Although “Engineer” is not a constituent token of the most common job title, it does appear in a large number of job titles (such as “Senior Software Engineer” and “Software Engineer”) that show up near the top of the job titles list. Therefore, the ego of this network appears to have connections to technical practitioners as well.
It was without a doubt more advanced than the preceding chapters in terms of core content, in that it began to address common problems such as normalization of (somewhat) messy data, similarity computation on normalized data, and concerns related to the computational efficiency of approaches for a common data mining technique. Although it might be difficult to process all of the material in a single reading, don’t be discouraged if you feel a bit overwhelmed.