2017 Retrospective: FGV uses Artificial Intelligence to develop unique text mining system

FGV’s School of Social Sciences (CPDOC) and School of Applied Mathematics (EMAp) have created the History Lab tool in partnership with Columbia University.
27 December 2017
2017 Retrospective: FGV uses Artificial Intelligence to develop unique text mining system

FGV’s School of Social Sciences (CPDOC) and School of Applied Mathematics (EMAp) have created the History Lab tool in partnership with Columbia University. Through Artificial Intelligence, researchers were able to use new research tools and compile an unprecedented amount of data regarding international relations and different countries, such as Brazil, United Kingdom and United States.

“'Virtual files are more and more fragmented, with dozens of collections scattered across the web, each with different metadata and search features. We compiled multiple collections into a single database, which is the largest group of unclassified documents processed and available to anyone outside the government,” said Renato Rocha Souza, EMAp professor and researcher.

CPDOC professor, Alexandre Moreli, said that the robot created by the History Lab project can read digital or scanned files and carry out a probabilistic and statistical analysis, without any limitations regarding quantity. According to Moreli, based on this process, the system then identifies related works within the universe of documents and creates cohesive theme groups – which could take a person months, years or even decades to accomplish.

Through the programming lines and algorithms created by the group of FGV researchers – which includes information experts and historians –, the tool was leveraged with artificial intelligence to identify and group related documents. The Brazilian front of the project published its papers in May 2016, using more than 10,000 documents of CPDOC’s historical collection, including documents from Brazil’s former Minister of Foreign Relations, Antônio Azeredo da Silveira. The tool’s performance was so fascinating that it even drew the attention of the U.S. Department of State (DoS) and Intelligence Agencies.

According to Renato Rocha Souza, among the collections processed by History Lab is the first DoS digitally-born collection dating back to the 1970s, with approximately one million documents. The professor said that most of this information is sensitive or classified, which is why this data is controlled by the country’s information access laws. Souza added that the problem lies in the fact that the decision to make each document public depends on its level of confidentiality, and this is an extremely delicate and tactful process that has been carried out manually so far, which explains why it is very time consuming and expensive. One of the goals of the History Lab is also to facilitate this process by automatically analyzing the contents of each piece of information filed and classifying its level of confidentiality, determining whether or not such information can be released to the public.

“In addition to this process regarding historical documents, they were also interested in our technology because of the recent data leaks to the Russian government and WikiLeaks. In 2015 alone, the DoS spent USD 16 billion to protect information. A good example is the document on the sensitivity of the Japanese government regarding U.S. inspections on nuclear facilities. This document wasn’t classified, but it should have originally been marked as confidential,” said Renato Rocha Souza.

Future – According to CPDOC professor Alexandre Moreli, the History Lab will soon expand the processed collections to include documents from other countries, in addition to making other tools available to researchers.

“FGV’s team will continue to develop the current features and search for more innovations, such as automatic face recognition from historical photographs or synchronizing audio and text from Oral History interviews. This expertise can be used both by file repository institutions or public agencies and companies or other stakeholders who want to monitor and explore sets of information at increasingly higher scales,” he said.

Go to the website to learn more about the project.