Our Technology

To archive web content, you need a harvesting tool (robot/crawler). Its task is navigating web pages, clicking links, and downloading and saving the content.

Veidemann

Several tools can accomplish this task. However, based on previous experiences with other harvesters and technologies, and to ensure the quality of the material, the National Library has chosen to develop its own harvester. 

The official name of the National Library’s harvesting tool is “Veidemann,” which in Norwegian means “hunter”. 

Veidemann uses a special version of the Chrome browser to render websites. If a website does not support this browser, we cannot guarantee that the harvested material will be as expected. 

The browser is controlled remotely by a robot that manages which websites to harvest, how often, how deeply, and various other parameters to prevent overloading the websites. 

Once Veidemann has harvested a website, the content will be securely stored in the WARC file format on our servers.

User-Agent

When a browser, harvester, or robot visits a website, they send information about which browser they are and which platform they are running on. It’s the browser’s way of identifying itself and can be used by websites, among other things, to display customised pages for different browsers. 

Veidemann declares the following User-Agent: «nlnbot/0.1 (+https://www.nb.no/nettarkivet)»

Robots.txt

Robots.txt is used by website owners to provide instructions on how they want harvesters and other robots on the internet to behave on their website. The instructions can include how often harvesters are allowed to click on links, exclude parts of the website, or whether the robot should be blocked. 

The National Library primarily adheres to the instructions set by website owners in robots.txt, but in some exceptional cases, it may be ignored.

Sitemap/nettstedskart

The Norwegian Web Archive harvests what is publicly accessible on the internet. This means we harvest what the browser displays on a website. We do not harvest databases, and there is limited support for dynamic websites. Therefore, there must be a link pointing to resources that we are expected to harvest. Major search engines (such as Google) have developed techniques that allow website owners to publish links to all resources through a sitemap

All resources on such a website map will be attempted to be harvested by our harvester. Alternatively, a link to this map can be sent via email to nettarkivet@nb.no, and we can specifically harvest it. A website map does not guarantee that the website will be harvested, but it provides a clue to our harvester so that it knows about the resources.