On August 4th, data science community Kaggle announced its presentation of a free, open pipeline to the machine-readable dataset of the open-access repository, arXiv.

“Having the entire arXiv corpus on Kaggle grows the potential of arXiv articles immensely,” said Eleonora Presani, arXiv Executive Director in Kaggle’s Medium article. “By offering the dataset on Kaggle we go beyond what humans can learn by reading all these articles and we make the data and information behind arXiv available to the public in a machine-readable format.”

Kaggle said its hope was to “empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction, and semantic search interfaces.”

The dataset is now available on Kaggle and will be updated weekly.

Read the full article on Medium, or the arXiv blog.


The UNIL-EPFL dhCenter ceased its activities on December 31, 2022. The contents of this site, with the exception of our members' pages, are no longer updated. Thanks to all of you for having kept this space alive! More information