arXiv makes its data available by S3 buckets, see: http://arxiv.org/help/bulk_data_s3 Some highlights from this page: - Papers are available in both PDF and latex source - Complete PDF data is about 270GB - Complete source data is about 190GB - All data lives in requester-pays buckets, so we'll have to cover the cost of the pull ($0.12/GB, about $50 total for both) Questions: - Where are we going to host the local copies? Anyone want to volunteer server space? - How will analysis work? - Option 1: grep pdf text for urls and/or known DOIs for software - Option 2: parse the source directly, maybe with plasTex? This could get expensive and difficult, but may give more reliable results
arXiv makes its data available by S3 buckets, see: http://arxiv.org/help/bulk_data_s3
Some highlights from this page:
Questions: