People and informational objects are interconnected, forming gigantic, interconnected, integrated information networks. By structuring these data objects into multiple types, such networks become semi-structured heterogeneous information networks. Most real world applications that handle big data, including interconnected social media and social networks, medical information systems, online e-commerce systems, or database systems, can be structured into typed, semi-structured, heterogeneous information networks. For example, in a medical care network, objects of multiple types, such as patients, doctors, diseases, medication, and links such as visits, diagnosis, and treatments are intertwined together, providing rich information and forming heterogeneous information networks. Effective construction, exploration and analysis of large-scale heterogeneous information networks poses an interesting but critical challenge.
In this talk, we first present a set of data mining scenarios in heterogeneous social and information networks and show that mining typed, heterogeneous networks is a new and promising research frontier in data mining research. Departing from many existing network models that view data as homogeneous graphs or networks, the semi-structured heterogeneous information network model leverages the rich semantics of typed nodes and links in a network and can uncover surprisingly rich knowledge from interconnected data. This heterogeneous network modeling will lead to the discovery of a set of new principles and methodologies for mining and exploring interconnected data, such as rank-based clustering and classification, meta path-based similarity search, and meta path-based link/relationship prediction. Then we discuss our recent progress on construction of quality semi-structured heterogeneous information networks from unstructured data. We will also point out some promising research directions in this domain.