Exciting news for scientists—and for the future of shared information: Nature has reported that technologist Carl Malamud has released a huge index of words and short phrases contained in over 107 million journal articles, including many paywalled papers, online for free. Says Malamud, the index, which contains tables of over 355 billion words and short fragments listed next to the articles where they appear, is an attempt to “help scientists use software to glean insights from published work even if they have no legal access to the underlying papers.”
Researchers told Nature that Malamud’s index is a major development in letting them search existing scientific literature with software; this is called text mining. Researchers already text mine papers to build useful databases of information, but often they’re restricted by lack of access to paywalled or private articles—this index will let researchers build their research on a larger set of scientific knowledge.
“I am very confident that what I’m doing is legal,” Malamud told Nature. “We are not doing this to provoke a lawsuit, we are doing it to advance science.” Malamud says that since his index only contains sentence snippets up to five words long, not articles’ full text, releasing the index doesn’t breach publishers’ copyright restrictions. However, a legal question may arise in regard to how Malamud obtained the papers to create the index; Malamud declined to say how he got copies of the 107 million articles.