Arctic Knot Conference 2021/Submissions/Sami language resources from Norwegian public sector internet domains

Submission no.
Title of the submission
Sami language resources from Norwegian public sector internet domains
Author of the submission
Andre Kåsen and Magnus Breder Birkenes
Submission format
  • Pre-recorded video presentation (15–30 mins)
Language of presentation
E-mail address
Country of origin
Affiliation, if any (organisation, company etc.)
Personal homepage or blog
Abstract (up to 300 words to describe your proposal)
We present the reuslts of a deep crawl of public content from Norwegian public entities. We downloaded 1.8 million unique web pages and text documents and extracted natural language from them using various text extraction methods (boilerplate removal, OCR). The resulting corpus consists of 4.3 billion words in various languages, e.g. 3.4 billion in Norwegian (Bokmål and Nynorsk), 5.7 million in Northern Sami, 400.000 in Southern Sami and 200.000 in Lule Sami. In the presentation, we will take a dive into the methods used and the Sami resources found on the domains of Norwegian public entities. The project is a cooperation between the National Library of Norway and the Norwegian Digitalisation Agency.

What will attendees take away from this session?
Theme of session
Language technology
Slides or further information (optional)
Special requests
Is this Submission a Draft or Final?

Interested attendees


If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with a hash and four tildes. (# ~~~~).