Notes: Text + Data Mining Symposium 2017

On Wednesday 12th July 2017, a group of librarians, researchers and experts in the field gathered for University of Cambridge’s 1st Text & Data Mining (TDM) Symposium at the Engineering Department in Cambridge.

I went into the symposium with the specific aim of gleaning information from a librarian’s point of view and to ascertain the possible angles that a library service built around TDM would and could operate from. This included its implications for library users both on the small scale and University-wide, its value, limitations and parameters.

So, what is TDM?

TDM allows a user, through digital techniques, to explore large collections of textual material, extract new datasets and, through analysis, find out new information about a topic.

What else did I learn?

Well, below is some of the photos I took at the conference and a summary of the notes I made.

Speaker: Kiera McNiece, FutureTDM

20170712_110624

Current statistics show that scholarly articles are now published at a rate of 2.5 million a year. That’s one paper every 2.5 seconds. That’s a heck of a lot of information to trawl through. This is where the concept of TDM comes in.

But, how does TDM identify the right subject?

Well, it needs machine-readable data.

Can a computer figure out how to download articles?

Well, this is the challenge. To prepare content for analysis. What a human sees as an image, a machine needs raw data and metadata to be able to analyse it. It also needs to have bulk access to content (currently a publisher-dependent factor). Librarians need to consider these TDM needs when negotiating with publishers.

One further challenge is the law. In some cases, such as mining the open web, licensing is impossible!

GDPR (General Data Protection Regulations) cover such things as name, age, gender, IP address, etc. so are sensitive data. In these instances, those combining these datasets are endangering anonymity.

FutureTDM’s role is to:

  • Increase content availability by building datasets
  • Support early adopters by developing tools
  • To foster a data-savvy society

One suggestion for libraries was to fragment TDM skills across departments. To communicate TDM values and support.

Idea: See FutureTDM’s “awareness sheets” for examples of success stories in this area.

So, to tackle the scale and complexity of data, we need to consider F.A.I.R. content. That’s content that is Findable, Accessible, Interopeable and Reusable.

Speaker: Charles Matthews, Wikimedian, ContentMine, Cambridge

ContentMine offer a series of tools on their website and would like help disseminating them.  You can find them here: http://contentmine.org/

Charles talked about the challenge that ContentMine have in sourcing (human) volunteers to reduce bad data through good judgement whilst maintaining their interest level.

He touched on issues of trust, recall and precision in the process and asked how successful was data scraping?

It’s an interesting dilemma to consider but one that left me with more questions than answers.

Speaker: Alison O’Mara-Eves, University College London

20170712_124316

Alison spoke on “managing the information deluge” and how text mining and machine learning are changing systematic review methods.

The old system of systematic review went as follows:

  1. Search
  2. Screen data
  3. Information extraction
  4. Analysis and drawing conclusions

However with Web of Science recently advertising news of their 1 billionth article published it is proof that the old system is no longer an effective method.

What have they found by using Text Mining (TM)?

Well, they found that it can inform search strategy. An interesting side-effect!

They also found that it resulted in a 30-97% reduction in workload. That means there is a huge cost factor to consider. However, the discrepancies in the analysis phase meant that they were also reluctant to completely remove the human element, meaning that they adopted a semi-automated process.

Potentially, they could have used a machine as a “second-screener” but they warned that it should be used with care.

A fully-automated process was highly promising but the performance varied.

These tests were conducted by having a known quantity of right hits and by ascertaining how long it took the machine to find them all.

Speaker: Georgina Cronin and Yvonne Nobis, Betty & Gordon Moore Library, Cambridge

20170712_141610

So, TDM, in short, can create databases that can themselves be mined.

What about the workflow of employing TDM and at what stage can libraries help at? Simple answer: every stage!

20170712_140613

  • Identify questions that need to be asked (via reference interviews)
  • Identify sources (libraries can help directing users to services and content and by advising on risk)
  • Access sources (connect people to resources)
  • Download resources (help via IT support)
  • Create a program to ask questions (find someone with skills or facilitate training sessions)
  • Analyse the results and interpret (connect with colleagues involved – don’t ask “how can I help?”, ask “what are you working on?”)

Advocacy is an issue to consider. The legal limitations are vague and are often enacted as restrictions placed by the publishers. Do your research but be bold with the publishers – they might actively encourage TDM?!

Example success story: Manchester University have agreed terms with Elsevier to provide access to TDM. The user simply has to provide their own mobile device (IP) on which he/she does the work. The IP address allows Elsevier to monitor activity.

“Unusual activity” is either benign or malign and can affect all campus users. Do the publishers consider the impact of permission removal?

With the legal and technical aspects to consider, it places librarians in a very different position. There is a real danger of individuals using TDM without being informed, endangering all users. We have a duty, as librarians, to inform!

Speaker: John McNaught, National Centre for Text Mining

John’s excellent talk gave us a quick chance to recap.

He asked what are the barriers to discovery?

20170712_143518

And what is TM in a nutshell?

 

20170712_143705

He then expanded on the semantic ambiguity revealed during text mining analysis.

One example he gave was the valid newspaper headline about Michael Foot: “Foot heads arms body”. In other words, this was a spot of reportage on Michael Foot’s leadership of a nuclear-disarmament committee! Of course, completely irrelevant when conducting a TM search on parts of the body, but a good example of the machine’s blind spots.

Boolean pre-searches might help at this point.

I noticed the similarities between this comment and the UX issues that our own readers are having when searching via iDiscover (another project I’m currently involved in).

John suggested solutions to the process before dazzling us with a scary sequence of slides.

20170712_144935

“Much knowledge lies  unsuspected” in archival information, he prompted.

20170712_145945

He talked about the first publicly-funded text mining centre in the world he works for – nactem.ac.uk – and their RobotAnalyst which aims to build upon current text mining technologies. One aim it has is to create a semi-automatic citation screening process.

20170712_145954

One counter-intuitive phrase he used that stuck in my mind was “TM is not monolithic”.

20170712_150017

He pointed out the gaps and issues of text mining in its current format. And he suggested possible next steps for libraries.

20170712_150140

He was keen to point out that if a library/institution decides to build a TDM annotation database, experience has taught NaCTeM to beware the pitfalls of big data.

20170712_15020020170712_15043820170712_150512

Round Table Discussion

As a final summary, the question was asked in a round table discussion just what the future challenges facing TDM are.

Salient points that I took away were as follows:

  • Can show “snips” but otherwise must address legal interoperability issues.
  • Country must speak with a single voice concerning TDM. With journal subscriptions currently costing £10-20 billion a year, policy is vital.
  • Problems when dealing with publishers must be addressed.
  • Algorithms would work better with correct and greater numbers of annotated datasets.
  • Use and misuse of tech systems that measure “benign” and “malign” use. Restrictive tech must be phased out. Publishers need to be more open.
  • Problems attracting and retaining qualified staff and problems engaging staff.
  • Restrictiveness. Uncertainty. Awareness.
  • Innovation. There are currently technical measures in place to prevent TDM access.
  • Systematic reviewers struggle with the value of TDM. They worry about losing results data.
  • Funders are pro-TDM so there is no challenge to overcome on that front!
  • Advocacy. Institutions are currently being cut-off by publisher safeguards.
  • Machine-readable data needs to be in one place.
  • Worry: Will publishers ask for more money if we are getting more use via TDM methodology?
  • Funding is vital and should be available to all.

An interesting take-away was one attending publisher’s concerns over TDM use. They wanted researchers to contact them prior to TDM use. Surely research is more fluent than these expectations? Providing foreknowledge is often an impossibility, as one researcher pointed out. The naivety of the publisher, concerning research methodology, in this instance was actually quite staggering and one hopes that this publisher standpoint isn’t repeated across the board.

So quite a heated finish but a thoroughly interesting event. Exciting times for proactively building library-researcher relationships then on a whole new playing field.

Please note: I hope I’ve managed to correctly  nail the statistics, terminology and intention of each speaker and nothing has got lost in translation. Apologies if I have got the wrong end of the stick. The correct source material is either there online or is being compiled – see below.

Further reading:

There was live tweeting throughout the day from several individual accounts – see the gathered information at #osctdm

The Office of Scholarly Communication (University of Cambridge) have been gathering materials from all the speakers’ talks.

The recordings they made on the day are here: https://www.youtube.com/playlist?list=PLG24w6ETyHS3fYbDnB6LOOzOfATVhP3zp

Their own “Next Steps” report can be found here: https://unlockingresearch.blog.lib.cam.ac.uk/?p=1505

Speaker Georgina Cronin has published an excellent blog post on her talk (prepared by both Georgina and her colleague Yvonne Nobis) here.

One of my fellow librarian bloggers also attended and came to these conclusions: https://thelibrarianerrant.wordpress.com/2017/07/13/text-and-data-mining-symposium/

Laurence Horton made live-notes on the day. They can be found here: https://docs.google.com/document/d/1-ik2G9Ix_y4SydNUTA47FPSjrseFfkaS_peZratuZS4/edit

For those wanting to recap on the excellent TDM talk given at the RLUK Conference 2017, the slides are here: “Developing a research Library position statement on Text and Data Mining in the UK” from a talk given at the RLUK 2017 Conference“.

 

 

 

 

 

Advertisements

One Comment Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s