Paths of A Million People: Extracting Life Trajectories from Wikipedia

ICWSM 2025

ShanghaiTech University
* Equal Contribution

Abstract

The life trajectories of notable people have been studied to pinpoint the times and places of significant events such as birth, death, education, marriage, competition, work, speeches, scientific discoveries, artistic achievements, and battles. Understanding how these individuals interact with others provides valuable insights for broader research into human dynamics.

However, the scarcity of trajectory data in terms of volume, density, and inter-person interactions, limits relevant studies from being comprehensive and interactive. We mine millions of biography pages from Wikipedia and tackle the generalization problem stemming from the variety and heterogeneity of the trajectory descriptions.

Our ensemble model COSMOS, which combines the idea of semi-supervised learning and contrastive learning, achieves an F1 score of 85.95%. Besides, we perform an empirical analysis on the trajectories of 8,272 historians to demonstrate the validity of the extracted results.

To facilitate the research on trajectory extractions and help the analytical studies to construct grand narratives, we make our code, the million-level extracted trajectories, and the hand-curated dataset WikiLifeTrajectory publicly available.

Architecture of COSMOS

Motivated by the need to better utilize the intrinsic structure of data, COSMOS combines both contrastive learning and semi-supervised learning to obtain robust representations. As illustrated in the figure, COSMOS learns the representations of triplets and their contexts through parallel CNN and BERT, and then classifies them based on the resulting representations.

Visualization of Life Trajectory

Life trajectories of H.Bruce Franklin, Karl Theodor Keim and John Henry Brown. The arrows of each color represent the life trajectory of the corresponding individual. The start point of each trajectory is marked with a circle. The year and purpose of the move are labeled on the arrows.

Co-occurrence Network

Dynamic interaction network comprising 899 historians. (a) Snapshots of the network every from 1910 to 2020. Nodes represent historians, sized by PageRank and colored by nationality. (b) and (c) zoom in on two connectedcomponents in the 1980 snapshot and 2020 snapshot respec-tively.

Million-level Life Trajectories 📊

From controlled experiments to large-scale practice, we scale up life trajectory extraction from 1.9 Million Wikipedia biographies to over 5 million life trajectory triplets (Person, Time, Location). The resulting dataset is publicly available for exploration!

  View on Hugging Face

BibTeX

@inproceedings{zhang2025paths,
    title={Paths of A Million People: Extracting Life Trajectories from Wikipedia},
    author={Zhang, Ying and Li, Xiaofeng and Liu, Zhaoyang and Zhang, Haipeng},
    booktitle={Proceedings of the International AAAI Conference on Web and Social Media},
    volume={19},
    pages={2226--2240},
    year={2025}
}