Information network has always been popular in the field of data mining. But most of the contemporary information network analyses only focus on homogeneous network(the type of objects or links is unique), or simply ignore the heterogeneity of objects and links. However, heterogeneous information network can contain more information and richer semantics compared with homogeneous network. In the paper, the authors present a comprehensive survey of the heterogeneous information network(HIN) analysis in recent years, and also point out some future directions.
In this part, the authors introduce some basic concepts in HIN. Here I simply note down the terminologies, but will not give the concrete definitions here.
Information network, Heterogeneous/Homogeneous information network, Network schema(specify type constraints on the sets of objects and relationships among the objects), Meta path.
The paper also gives comparisons heterogeneous network with other related concepts, including: multi-relation network(has only one type of objects, but more then one kind of relationships beween objects), multi-dimensional/mode network(same as multi-relation network), composite network(same as multi-relation network), and complex network(a network with non-trivial topological features and patterns of connection between its elements that are neither purely regular nor purely random). All these kind of networks are heterogeneous networks.
Heterogeneous networks can be constructed from three types of data: Structured data, semi-structured data and non-structured data. Some widely used heterogeneous networks: Multi-relational network with single-typed object(Facebook and Xiaonei), Bipartite network(user-item, document-word, k-partite graphs), Star-schema nework(IMDB, patent data, bibliographic information network), Multiple-hub network(bioinformatics network, Douban dataset), etc.
Some real data are too complex to be modeled as meaningful HINs; sometimes difficult to analyze some networked data with an HIN perspective.
In this section, the authors give us several parts of research developments and each part contains relevant different researches. Here I do not mention the research but only give a basic definition and the direction of reseach.
To evaluate the similarity of objects, and can be categorized into feature based approaches and link based approaches. Recently many researchers consider similarity measure on HINs and more works begin to integrate the network structure and other information to measure similarity of objects.
The process of partitioning a set of data objects into a set of clusters. And compared with homogeneous networks, HIN integrate multi-typed objects, which brings new challenges to clustering. Usually, attribute information, text information and user guide information are integrated into clustering analysis. Besides, clustering can be integrated with other mining tasks to improve performances through mutual enhancing, like RankClus and NetClus, etc.
A data analysis task where a model or classifier is constructed to predict class (categorical) labels. In HIN, we have multiple types of objects. so we can perform multi-label classification; besides, label knowledge can spread through various links among diffenrent-typed objects. Meta path, a unique characteristic in HIN, is also widely used in classification on HIN, like GNetMine and HetPathMine.
Attempts to estimate the likelihood of the existence of a linke between two nodes, based on observed linkd and the attributes of nodes. In HIN, the linkds to be predicted are of different types, and there are dependencies existing among multiple types of links. Many works utilize the meta path to help link prediction, like PathPredict. In addition, probabilistic models are also widely applied for link prediction tasks in HIN.
Most of the available works on link prediction are designed for static networks, but dynamic link prediction is also very important and needed to be solved.
Evaluate object importance or popularity based on some ranking functions. Ranking in HIN faces many challenges duet to HIN’s heterogeneity. Some works propose path based ranking methods considering the characteristics of meta path on HIN. And now, ranking problem is also extended to HIN constructed by social media network.
Recommendation system helps consumers to search products that they might be interested in. Tradictional researches use collaborative filtering and matrix factorization in homogeneous network. Heterogeneous information is important for recommendation, and meta path is well used to explore the semantics and extract relations among objects.
The process of merging information from heterogeneous sources with differing conceptual, contextual and typographical representations. To fuse information from different HINs, an important prerequiste is to align the HINs via shared common entities.
By transferring the available heterogeneous information to other aligned networks, we can help solve data sparsity, and improve the result of link prediction, community detection, recommendaion, etc.
Influence propagation and privacy risk problem in HINs.
The heterogeneous information network is still a young and promising research. In this section, the authors illustrate some advanced topics and point out some future directions of HIN.
- More complex network construction
- More powerful mining methods
- Bigger networked data
- More applications: online analytical processing, information diffusion, etc.
Most of the networks in our real life are heterogeous, so the analysis of heterogeneous information network is very important and imperative. Combined with network representation, it may be useful if we can think of a generic method to represent HIN. Besides, we also need to know how to due with big data network, making use of distributed platform.