A network for new and aspiring library professionals
This is the second in our series of Fact Sheets. The Fact Sheets are short posts that offer a summary and a selection of useful resources on a given topic.
The Fact Sheets will cover topics that are emerging trends in the library and information sector and/or topics that have already been covered by other library/Information professions. We hope the Fact Sheets will be a starting block for someone who is interested in the topic.
We will try to maintain the resources section for each Fact Sheet as the topics evolve and when one of you becomes an expert in the topic, we will be begging you to present at one of our events!
Please add your own comments, suggestions and resources in the comments box.
Fact Sheet: Text mining
Data mining, Text analysis and Data analysis
“Text Mining (TM) and Natural Language Processing (NLP) provide algorithms and techniques for automated summarization and analysis of textual content, so that it is possible to extract and interpret the information contained in literature databases and repositories.” Nuzzo, A et al (2010)
Scenario – You used a search engine to search for ‘NLPN’ and found 10 references. By reading them, you find that 4 references mention the co-founders and their day jobs, 5 references focus on the NLPN (blogs, events etc.) and 1 mentions the National Literacy Programme in Namibia. By reading and analysing the text you are able to establish connections between the co-founders, the NLPN and libraries. You are also able to recognise that although the National Literacy Programme in Namibia might not be relevant, but there is a relationship e.g. same acronym and both with an interest in literacy/libraries. It’s an effective but time consuming way of analysing text.
You might be thinking why are you telling me how to analyse text? Well, humans are great at establishing links (no matter how tenuous) between two things. Machines, not so much – until now! Unlike search engines, which search for words/phrases that you have searched for and which pull out all the documents that mention those terms, text mining goes further by categorising and highlighting connections between batches of analysed documents using natural language processing. Thus by highlighting the connections within the data set, text mining is able to create new information in a shorter time frame, compared to a human.
As vast amounts of information (of varying degrees of quality) are being produced on a daily basis, text mining tools aim to reduce the amount of time spent analysing each individual article by hand. Databases such as Carrot2 are designed to analyse text, group keywords together and generate visual maps so that it is easier to identify key themes that are relevant to your search.
Text mining has demonstrated a positive impact on research by enabling, for instance, data analysts to identify links that might have otherwise been missed. And as Ananiadou (2009) mentions, text mining is able to analyse data across multiple sources whilst reducing human error, information overload and information overlook.
As text mining analyses full text documents it applies greater pressure on the need for high quality open access research which can be analysed. The use of text-mining software also needs to be taken into consideration by Institutions, particularly when educating staff on copyright issues or negotiating with the copyright licence agency as articles that are not open access, part of a subscription or copyright fee paid may infringe the institutions copyright licence agreement.
NaCTeM – The national centre for text mining
Ananaiadou, S et al (2009) Text Mining for Health and Care and Medicine
Nuzzo, A et al (2010) Text Mining approaches for automated literature knowledge extraction and representation. Medinfo. 160:954-8