Arctic Knot Conference 2021/Submissions/Sami language resources from Norwegian public sector internet domains

Sami language resources from Norwegian public sector internet domains
Andre Kåsen and Magnus Breder Birkenes
  • Pre-recorded video presentation (15–30 mins)
We present the reuslts of a deep crawl of public content from Norwegian public entities. We downloaded 1.8 million unique web pages and text documents and extracted natural language from them using various text extraction methods (boilerplate removal, OCR). The resulting corpus consists of 4.3 billion words in various languages, e.g. 3.4 billion in Norwegian (Bokmål and Nynorsk), 5.7 million in Northern Sami, 400.000 in Southern Sami and 200.000 in Lule Sami. In the presentation, we will take a dive into the methods used and the Sami resources found on the domains of Norwegian public entities. The project is a cooperation between the National Library of Norway and the Norwegian Digitalisation Agency.

Language technology
