DH 2016 Workshop

Data Mining Digital Libraries (Digital Humanities 2016 - Kraków, Poland)

 Call for Papers: Workshop on Data Mining Digital Libraries

 Call Deadline: May 15th

 Workshop Date: July 12, 2016, 09:30-13:00

 Workshop Venue: Jagiellonian University and Pedagogical University of Kraków, Poland


Data Mining Digital Libraries

The central theme for this workshop is data mining and the connection between metadata and data in the context of digital libraries. Digital resources and search engines raise several ques-tions about the relationship between metadata and the data they describe. For example, what is the relation between metadata keywords and classification categories? How do we label the topics found by topic modeling algorithms? With readily available search engine technology, using document relevance based on content words, is there a need for library classification systems at all, like Dewey or UDC?

  • The structure of subject headings and descriptors, used in book classification (e.g. in building thesauri)
  • The relationship between topic words and library classification systems
  • The relationship between content words and topic words (of existing metadata, or as output from topic modeling algorithms)
  • Automatic classification of digital documents
  • Authorship attribution
  • Development of computational services for research and the general public
  • Legal issues arising with different data mining practices

Please send us an abstract of max. 500 words that is situated within the above context to  sprakbanken@nb.no  within May 15th.

All participants must register for the Digital Humanities 2016 conference: Registration .


The ongoing trend towards increased digitization in society in general poses numerous challenges at many levels, but also opens up for vast opportunities within many fields, including the library sector.

At the National Library of Norway, a mass digitization project was initiated in 2006, with the goal of digitizing the entire collection of books, newspapers, movies, radio- and television-broadcasts, music etc., in sum everything published in the public domain in Norway of all media types, i.e. the entire cultural heritage of Norway. For books, the goal is to have the entire stock digitized by 2017. Thus far, some 435.000 of 450.000-500.000 books have been digitized. When all books and newspapers have been digitized, we estimate that our Norwegian text corpus will consist of some 80 - 100 billion tokens, which is big for a rather small language like Norwegian with approximately 5 million speakers. In comparison, the Google Books corpus contains approximately 500 billion tokens for English.

The National Library cooperates with scholars of literary studies and linguistics in developing and applying methods of data mining to the digital collection. We develop services that make the content available for quantitative research, without challenging intellectual property rights. One such service is NB N-gram for Norwegian (see http://www.nb.no/sp_tjenester/beta/ngram_1/ ), comparable to Google Ngram Viewer for English and other languages.

Workshop leaders

  • Lars G. Johnsen, research librarian, National Library of Norway
  • Arne Martinus Lindstad, research librarian, National Library of Norway
  • Magnus Breder Birkenes, research librarian, National Library of Norway

Program comittee

  • Oddrun Ohren (National Library of Norway)
  • Koenraad De Smedt (University of Bergen)
  • Anders Nøklestad (University of Oslo)
  • Elise Conradi (National Library of Norway)
