What is LinkedIn’s knowledge graph?
LinkedIn’s knowledge graph is a large knowledge base built upon “entities” on LinkedIn, such as members, jobs, titles, skills, companies, geographical locations, schools, etc. These entities and the relationships among them form the ontology of the professional world and are used by LinkedIn to enhance its recommender systems, search, monetization and consumer products, and business and consumer analytics.
If you don’t know how a knowledge graph works, think of it as a system made of nodes and edges. Where the nodes are the things that exist on LinkedIn (jobs, skills, companies and so on) and the edges are the relationships among them.
How does LinkedIn Knowledge Graph look like?
By building up a graph based on entities and relationships among them, new relationships can spring up thanks to that graph. Think of the case in which you have a skill, yet for the job you’re looking for you need other skills as well. Based on your job preferences LinkedIn can infer the skills you need to get that jobs, thus giving you suggestion to learn those skills:
Whys is the knowledge graph critical to make LinkedIn smarter?
LinkedIn Knowledge graph is the most important asset the company has
Creating a large knowledge base is a big challenge. Websites like Wikipedia and Freebase primarily rely on direct contributions from human volunteers. Other related work, such as Google’s Knowledge Vault and Microsoft’s Satori, focuses on automatically extracting facts from the internet for constructing knowledge bases. Different from these efforts, we derive LinkedIn’s knowledge graph primarily from a large amount of user-generated content from members, recruiters, advertisers, and company administrators, and supplement it with data extracted from the internet, which is noisy and can have duplicates. The knowledge graph needs to scale as new members register, new jobs are posted, new companies, skills, and titles appear in member profiles and job descriptions, etc.
Entities are created in two ways:
- Organic entities are generated by users (Explicit): when you compile your profile, you’re building data for LinkedIn knowledge graph. For instance, when you add a company to your profile. At the same time, that company has a page administered by another user, with relevant information about that company. That is data on which LinkedIn can leverage to build its smart infrastructure that relies on the graph. LinkedIn engineering team calls those member-generated entity relationships “explicit.”
- Auto-created entities are generated by LinkedIn (Inferred): imagine the case in which you entered in your profile the company’s name. Let’s assume you misspelled it. Even so, LinkedIn has to have a mechanism to fix that mistake to avoid to have misplaced data in its knowledge graph. What does it do? Simple, some LinkedIn algorithms will infer what you meant and fix the mistake. As you can imagine, the organic content generated by users (so-called explicit) might have many errors.
How does LinkedIn auto-create entities?
LinkedIn looks for entity candidates among the data it finds on members’ profiles. This includes information related tens of thousands of skills, titles, locations, companies, certificates and so on. Those are the entities in LinkedIn Knowledge Graph. Those entities represent the nodes.
The process follows four steps:
- Generate candidates: think of them as simple English phrases (such as “Gennaro created FourWeekMBA.com”)
- Disambiguate entities: a phrase can have a meaning based on the context. Thus this process allows identifying the meaning of a phase according to the context on which it sits
- De-duplicate entities: multiple phrases that might represent the same entity are organized in word vectors and clustered
- Translate entities into other languages: the top level entities are translated to allow high precision
The knowledge graph is built on top of taxonomies. From the example above you can see the hierarchical structure of taxonomy. Where you have a cluster of terms, such as “Software Engineer,” “Developer” or “Programmer” grouped under “Software Developer.” Which is under “Engineering.”
Those taxonomies are organized in two ways:
- Relationships to other entities in a taxonomy: think of all the connections between several entities. For instance, how companies, members, skills, and industries are connected
- and characteristic features not in any taxonomy: think of all the metadata (data about data). For instance, the company logo, revenue, URL and so on
This allows LinkedIn to build a knowledge graph where the relationships are the edges, while the entities or things in the graph are the nodes.
Where does LinkedIn get the data to create those relationships and entities?
LinkedIn engineering team refers to it as “LinkedIn ecosystem.” This is made of a few main parts:
- the mappings from members to other entities (like the skills of each member) are critical for things like ad targeting, people search, recruiter search, feed, and business and consumer analytics;
- the mappings from jobs to other entities (like the skills required for jobs) are instead critical to driving job recommendations and job search;
As we’ve seen one crucial aspect of LinkedIn knowledge graph is the production of data. We’ve seen two kinds of data: explicit and inferred. How does it work practically?
Explicit vs. Inferred LinkedIn generated data and relationships
In the article from the LinkedIn, the engineering team is interesting to compare the user-generated data, which LinkedIn calls explicit. With the data automatically generated by the LinkedIn algorithm, which is called implicit. There is an interesting case study as an example:
You can see on the left the skills generated by the member (like “Distributed Systems,” “Hadoop,” “Scalability” and so on). On the right side, you can see the inferred skills, wich a certain degree of confidence (“Product Management,” “Management,” “Consulting,” “Networking” and so on).
In other words, besides the skills that you point out in your profile LinkedIn algorithm computes a set of skills to make sure to balance the user-generated content.
The LinkedIn Knowledge Graph in action
As specified so far what makes LinkedIn platform thick is the information and data organized in a large knowledge graph, made of nodes and edges. This vast database is technically called a GraphDB (graph database). This is extremely effective as it allows platforms like LinkedIn to scale up:
Graph databases shine when you are trying to relate entities (nodes) to each other along relationships (edges). On top of new functionalities, the GraphDB at LinkedIn is heavily optimized, and is able to support millions of queries per second at very low latencies.
LinkedIn algorithms have become way smarter. However, AI and machine-learning models need a lot of data to improve. Handling massive amount data efficiently becomes critical. The knowledge graph is the infrastructure that allows doing just that. It allows handling complex, structured and vast amount of data quickly and efficiently. That is why understanding how a knowledge graph works is also critical to understand how those platforms are evolving.
Other companies that manage a massive amount of metadata (data about data), like Facebook and Google are using the same kind of infrastructure.