I am trying to understand automatic entity recognition (AER) and its connections to libraries and to metadata.
The basic concept is simple, I think: A computer processes documents or text and “recognizes” all the entities that are mentioned in it and marks them up semantically.
Entities are anything that can be named as a noun: mayonnaise, radio antennas, Kentucky, the Super Bowl, epistemology, etc.
Recognition, in this context, means to match the entities named in the text to a standard representation or surrogate for that term, such as an authority record or a linked data uniform resource identifier (URI).
I know that my smart phone is able to recognize telephone numbers in emails that I read on the device, and I guess this is an example of automatic entity recognition: the phone numbers are indeed entities, and the machine recognizes them as such — automatically — and allows me to make a phone call simply by clicking on the hotlink it creates when it recognizes a phone number.
Of course, not all ten-digit numbers are telephone numbers, so my smart phone will occasionally mark up non-telephone numbers as telephone numbers, but to be honest, this error almost never happens.
The future of automatic entity recognition is tied to the wide application of linked data, and specifically library linked data, as libraries (especially digital libraries) are repositories of machine-readable text and documents.
The goal of AER, I think, is to process objects stored in digital libraries by marking up the words that represent entities within them using linked data structures. For example, if the word mayonnaise occurs in text, the system will mark it up, say, with the string for that entity in DBpedia: http://dbpedia.org/page/Mayonnaise (or some other semantic web ontology).
In marked up text, it might appear like this:
Aioli is, like mayonnaise, an emulsion or a suspension of small globules of oil and oil soluble compounds in water and water soluble compounds.[1]
… except that all the other entities in the passage would also be marked up with their own linked data URI.
I see two problems — or perhaps hurdles — to this approach. The first is that linked data is also supposed to express relationships among things, and simple entity recognition doesn’t accomplish that. For example, what’s the relationship to the entity mayonnaise in the text above? Is it the subject of the piece? No. Entity recognition alone doesn’t help with context or relationships among entities.
The second problem is what in my research I have called the homonym problem.[2] Artificialintelligence hasn’t advanced enough to find and identify entities in text successfully and unambiguously. For example, mayonnaise can be a condiment, and it can also be a Filipino alternative rock/pop-punk band, according to Wikipedia.
But there are better examples. Computers still cannot successfully disambiguate homonyms like the word abstract in text. It can be a summary (an article abstract) or something vague, like an abstract concept. Thousands of other homonyms occur regularly in text, and systems won’t be able to identify them correctly all the time. Even the best search engines today can’t always get this task correct; for example, Google Scholar has listed authors such as “N. Vietnam” and “V. Conclusion.”
In theory, automatic entity recognition is a great idea, and work that leads to a successful implementation of this idea is worthy. In practice, artificial intelligence has a long way to go before it works well, especially across all domains of knowledge and across all languages.
[1]. http://en.wikipedia.org/wiki/Aioli
[2]. Beall, Jeffrey. (2008). “The weaknesses of full-text searching.” The Journal of Academic Librarianship, 43.5: 438-444.
Posted via email from Metadata | Comment »
•