Researchers are gathering South African news headlines to power AI
If an AI researcher wants to build a natural language processing model in English, there’s no shortage of data to train her algorithms.
With a click, she could have 1.8 million articles from the New York Times archives, carefully tagged by topic. She might throw in 800,000 stories from the Reuters archives, or 30 million words of text from the Wall Street Journal. Of course, she could also just use the state-of-the-art GPT-3 language model, which cut its teeth on more than 290 billion English words scraped from around the web.
But if she wants to build a model that will work for Setswana or Sepedi, two of South Africa’s 11 official languages, her best bet might be a nascent dataset of a few hundred headlines drawn from the Facebook page of the South African Broadcasting Corporation (SABC). The corpus is the work of researchers from seven South African universities, who aspire to build up their own version of the massive datasets that exist for US newspapers to power natural language processing (NLP) programs.
Read the rest of this story on qz.com. Become a member to get unlimited access to Quartz’s journalism.