September 19, 2011

Article: Class with Fred Kilgour

Twenty years ago I had an article published in the OCLC Newsletter entitled “Class with Fred Kilgour.”  Here’s the formal citation:

Beall, Jeffrey. (1991). “Class with Fred Kilgour.” OCLC Newsletter 190, March/April: 13-14.

Now I am researching an article about Fred Kilgour, and I unearthed the original print version of the article. It spans three pages, so I scanned it and saved it as three separate .JPG files, which I include below.

My article has a lot of stuff that I had forgotten about, so my work of twenty years ago will be helpful as I write the new article.

Click on any section to make it bigger:   

Posted via email from Metadata | Comment »

September 15, 2011

Just published: Abbreviations, Full Spellings, and Searchers’ Preferences

Small_one

My article “Abbreviations, Full Spellings, and Searchers’ Preferences” has just been published in the journal Cataloging & Classification Quarterly.

Here is the abstract:

This study examined ten, selected word pairs, each containing a word’s full spelling and its abbreviation, to determine which form search engine users preferred in searching. Using seven search logs gathered from several Internet search engines with approximately 608 MB of data, the study measured the occurrences of the twenty terms. The selected words are important in library cataloging, for some are prescribed abbreviations in metadata content standards. The study found that in eight of the ten word pairs users preferred to search words’ full spellings over the abbreviations, often by a high margin.

The article is available online here:  http://www.tandfonline.com/doi/abs/10.1080/01639374.2011.595886

 

Posted via email from Metadata | Comment »

September 4, 2011

What is Automatic Entity Recognition?

440247853_b01a65c219

I am trying to understand automatic entity recognition (AER) and its connections to libraries and to metadata.

The basic concept is simple, I think: A computer processes documents or text and “recognizes” all the entities that are mentioned in it and marks them up semantically.

Entities are anything that can be named as a noun: mayonnaise, radio antennas, Kentucky, the Super Bowl, epistemology, etc.

Recognition, in this context, means to match the entities named in the text to a standard representation or surrogate for that term, such as an authority record or a linked data uniform resource identifier (URI).

I know that my smart phone is able to recognize telephone numbers in emails that I read on the device, and I guess this is an example of automatic entity recognition: the phone numbers are indeed entities, and the machine recognizes them as such — automatically — and allows me to make a phone call simply by clicking on the hotlink it creates when it recognizes a phone number.

Of course, not all ten-digit numbers are telephone numbers, so my smart phone will occasionally mark up non-telephone  numbers as telephone numbers, but to be honest, this error almost never happens.

The future of automatic entity recognition is tied to the wide application of linked data, and specifically library linked data, as libraries (especially digital libraries) are repositories of machine-readable text and documents.

The goal of AER, I think, is to process objects stored in digital libraries by marking up the words that represent entities within them using linked data structures. For example, if the word mayonnaise occurs in text, the system will mark it up, say, with the string for that entity in DBpedia:  http://dbpedia.org/page/Mayonnaise (or some other semantic web ontology).

In marked up text, it might appear like this:

Aioli is, like mayonnaise, an emulsion or a suspension of small globules of oil and oil soluble compounds in water and water soluble compounds.[1]

… except that all the other entities in the passage would also be marked up with their own linked data URI.  

I see two problems — or perhaps hurdles — to this approach. The first is that linked data is also supposed to express relationships among things, and simple entity recognition doesn’t accomplish that. For example, what’s the relationship to the entity mayonnaise in the text above? Is it the subject of the piece? No. Entity recognition alone doesn’t help with context or relationships among entities.

The second problem is what in my research I have called the homonym problem.[2] Artificialintelligence hasn’t advanced  enough to find and identify entities in text successfully and unambiguously. For example, mayonnaise can be a condiment, and it can also be a Filipino alternative rock/pop-punk band, according to Wikipedia.

But there are better examples. Computers still cannot successfully disambiguate homonyms like the word abstract in text. It can be a summary (an article abstract) or something vague, like an abstract concept. Thousands of other homonyms occur regularly in text, and systems won’t be able to identify them correctly all the time. Even the best search engines today can’t always get this task correct; for example, Google Scholar has listed authors such as “N. Vietnam” and “V. Conclusion.”     

In theory, automatic entity recognition is a great idea, and work that leads to a successful implementation of this idea is worthy. In practice, artificial intelligence has a long way to go before it works well, especially across all domains of knowledge and across all languages.

[1]. http://en.wikipedia.org/wiki/Aioli

[2]. Beall, Jeffrey. (2008). “The weaknesses of full-text searching.” The Journal of Academic Librarianship, 43.5: 438-444.

Posted via email from Metadata | Comment »

August 31, 2011

Librarians and the Threat to Free Political Speech (Opinion piece)

I know this doesn’t have much to do with metadata, but I wanted to announce that an open-access, online pre-print of my upcoming “On My Mind” opinion column is now available on the ALA web site.

Here is the link: http://americanlibrariesmagazine.org/columns/my-mind/librarians-and-threat-free-political-speech

8-31-2011_3-28-16_pm

Posted via email from Metadata | Comment »

August 16, 2011

Article Review: “Birger Hjørland’s Manichean misconstruction of Marcia Bates’ work” by Marcia J. Bates

Marcia_bates

Manichean means “Pertaining to a strongly dualistic worldview,” and in this article Bates complains that Birger Hjørland misinterprets her work by saying he is right and she is wrong.

 

After a long and desultory explanation, Bates then concludes that she is right and Hjørland is wrong.

 

I have never seen a scholarly article quite like this. It is a defensive piece that attempts to save the reputation of a woman at the end a long career in information science and psychology.

 

Bates patronizes her critic: “Hjørland has a gift for making provocative statements … ” She also makes excuses about her own work: “age and health issues will probably prevent me from completing that larger project on information”. 

 

I have found Hjørland to be among the most interesting writers in library and information science alive today. For example, he recently wrote an article with a fresh and unique perspective on evidence-based librarianship, a perspective that dared to take a politically incorrect stance and analyze EBL in a new light.

 

I hadn’t ever heard of Marcia Bates before but have read several of Hjørland ‘s articles. I look forward to hearing more from Hjørland in the future.

 

 

Endnotes:

1. The pre-print is here: http://dx.doi.org/10.1002/asi.21594

2. The quotations lack page numbers because at the time of this writing the article is in pre-publication.

 

 

 

 

 

Posted via email from Metadata | Comment »

Article Review: “Birger Hjørland’s Manichean misconstruction of Marcia Bates’ work” by Marcia J. Bates

Marcia_bates

Manichean means “Pertaining to a strongly dualistic worldview,” and in this article Bates complains that Birger Hjørland misinterprets her work by saying he is right and she is wrong.

 

After a long and desultory explanation, Bates then concludes that she is right and Hjørland is wrong.

 

I have never seen a scholarly article quite like this. It is a defensive piece that attempts to save the reputation of a woman at the end a long career in information science and psychology.

 

Bates patronizes her critic: “Hjørland has a gift for making provocative statements … ” She also makes excuses about her own work: “age and health issues will probably prevent me from completing that larger project on information”. 

 

I have found Hjørland to be among the most interesting writers in library and information science alive today. For example, he recently wrote an article with a fresh and unique perspective on evidence-based librarianship, a perspective that dared to take a politically incorrect stance and analyze EBL in a new light.

 

I hadn’t ever heard of Marcia Bates before but have read several of Hjørland ‘s articles. I look forward to hearing more from Hjørland in the future.

 

 

Endnotes:

1. The pre-print is here: http://dx.doi.org/10.1002/asi.21594

2. The quotations lack page numbers because at the time of this writing the article is in pre-publication.

 

 

 

 

 

Posted via email from Metadata | Comment »

July 25, 2011

A Libertarian Perspective on Scholarly Open Access Publishing

 1. There is nothing wrong with publishing scholarly works for profit. Corporate bodies, including companies, institutions, and associations, are free to acquire articles, datasets, monographs, and other scholarship and to make the works available for a fee.

 

2. An individual is free to conduct his own research and keep it private, publish it openly on the internet, or license or sell it to a publisher or to any other organization. The same is true for groups of individuals, who may either conduct their own research or sponsor others’ research. Any attempt to limit this freedom is a violation of human rights.

 

3. It is wrong and unethical to compel an individual to make his intellectual property open access when he has not entered into an agreement to do so. The same is true for groups of individuals and corporate bodies.

 

4. Governments should generally not sponsor research. Instead, the private sector should sponsor and conduct research. A government may, however, sponsor its own research that assists in maintaining national security, criminal justice administration, etc. that it keeps private.

 

5. Governments have a necessary role in protecting the intellectual property rights of the owners of research and scholarly content, and this role should be exercised rigorously. 

 

6. Access to scholarship, even medical research, is not a civil right. All individuals or associations of individuals have the right to own intellectual property and to do with it as they please.

 

7. The high cost of scholarship does not justify the denial of intellectual property rights or the freedom to sell or license intellectual property. Limiting others’ intellectual property rights is always wrong because it constitutes a taking away of one’s freedom.

 

8. An unrestricted and non-subsidized marketplace for the creation and publication of intellectual property is the best and most productive way to generate and disseminate scholarly research for the benefit of all.

 

—Jeffrey Beall

Posted via email from Metadata | Comment »

July 16, 2011

Two Positive Things about Library Linked Data. Part 1: Precision and Recall

One of the things I find positive and appealing about library linked data is the potential for the almost-perfect precision and recall that it promises.

The advent of Google and other full-text search engines have caused information scientists to mostly ignore these two formerly-important measures of success in information retrieval.

Today, relevance, however undefined it is and however it differs from search engine to search engine, is the standard. Things that appear near the top of your search results are relevant; the stuff on the second page is not.

But we all know that the stuff on subsequent search results pages can indeed be relevant and valuable.

Library linked data will bring back the important measures of recall and precision.

Precision relates to the homonym problem and is a measure of the proportion of relevant results retrieved in a search to the total number of results in the search.

For example, if you search for “springs,” you will retrieve hits about springs (water coming from the ground) and springs (mechanisms that are round, stretch, and bounce back). You probably only want one of these, but the search engine delivers them all, lowering the precision of your search results.

Recall relates to the synonym problem and is a measure of the proportion of relevant results retrieved in a search to the total number of relevant items in the database being searched.

For example, if you search for “leprosy,” you will retrieve documents that contain that term, but you will miss the ones that only use the term “Hansen’s disease.”

These problems occur because human language is at once ambiguous and rich. A single word can mean different things and there can be many different ways to say one thing.

Because it substitutes computer language (uniform resource identifiers) for human language, library linked data has the potential to completely eliminate these two problems and to generate search results with perfect precision and recall, an ideal of library science since before card catalogs even existed.

Any system that has this potential is certainly worthy of further study and open prototypes. I would love to see a prototype set up in a library information retrieval context.

Posted via email from Metadata | Comment »

July 1, 2011

Review of Eric Hellman’s Talk at ALA Annual 2011

At ALA Annual in New Orleans in June, 2011, I attended a LITA-sponsored presentation entitled “Linked In: Library Data and the Semantic Web,” with speakers Eric Hellman and Ross Singer. This is a critical review of the Hellman segment of the meeting.

Hellman’s talk was among the most arrogant and flippant I had ever attended at an ALA conference. His talk was supposed to be about linked data, but he exploited his position as speaker to unwarrantedly trash libraries, library standards, and librarians. 

For example, he merely talked about libraries in terms of spaces, people, and community. In the first half of his talk, he played the sociologist and tried to explain why libraries were so behind and how library standards and practices are so out of touch with the rest of the developed world.

Referring to metadata records, he insisted, “We don’t need surrogates.” He boasted that full-text searching is sufficient for all information seeking and retrieval needs. But when asked by an attendee about things in other languages and things like images that lack language content, he just blew her off and stated, “We have to manage abundance,” in a way that made it seem like he had heard the phrase elsewhere and was just parroting it back as his defense against the very valid question.

One of his slides stated, “The #1 purpose of library data in the digital information age is SEO.” This was a revealing statement because it shows how little he actually knows about library data and the functions of libraries. He ignores information organization, mediation, and preservation, still vital functions of libraries. He appeared completely ignorant of the weaknesses of full-text searching.

Hellman praised what he called microdata and schema.org, even though earlier in the talk he’d flippantly dismissed all metadata in favor of full text searching. He gushed over these two inchoate technologies, revealing the typical techie weakness of liking things just because they are new and rejecting things just because they are old.

Moreover, by promoting these metadata standards, he completely contradicted his earlier statements that full-text searching was sufficient for all information finding. 

There is a logical fallacy called “appeal to the people” that Hellman tried to use to convince his audience that libraries are all wrong on metadata. The fallacy is described like this:

"If you suggest too strongly that someone’s claim or argument is correct simply because it’s what most everyone believes, then you’ve committed the fallacy of appeal to the people."

This is the approach Hellman used to the extreme. And not only did he employ the logical fallacy, he also used disrespect, derision, and sarcasm to make his points. For instance, at one point, seeking laughs, he said that after nuclear Armageddon there would only be left cockroaches and MARC records.

When OPACs began to replace card catalogs, there were no fools giving poisonous talks about the older technology. Over time, libraries recognized the newer and better technology and migrated to it. Hellman would insult people for taking trains instead of going by air. If he were truly confident in the new technologies he believes in so strongly, then he would not reveal his insecurities by mocking the earlier ones.

Epilogue: In contrast, the talk by Ross Singer was excellent. He gave an upbeat and positive presentation about the benefits of linked data and the Semantic Web and how libraries might effectively use them. Suzanne Graham did an excellent job organizing and moderating the talks. 

Image003

  Eric Hellman.
Image001

Posted via email from Metadata | Comment »

June 30, 2011

What I learned at ALA

2011 Annual ed.

Linked data: The many library consultants speaking at the conference believe that we are fools for continuing to use the MARC format. They did lots of evangelical work about library linked data, saturating us with quotes from Sir Tim and old slides with circles and connecting lines.

Discussion questions: If linked data is so great, why do you have to sell it to us? Why are the freelance consultants promoting linked data so strongly? What do they have to gain from it? Why do we not see wide adoption of this perfect solution in other domains?

A healthy skepticism is not a weakness. Librarians are critical thinkers and have learned not to adopt new technologies just because they are new.

Harrah’s Casino: It is not generally a good idea to go to a casino during ALA, except perhaps for the breakfast buffet or as an air-conditioned shortcut between two places. I learned this.

Eric Miller: The president of Zepheira was the speaker at the PCC Participants’ meeting. Sounding like a Baptist preacher, he tried to save the audience from their MARC sins and to convert the audience to linked data. Miller spoke as a general in the linked-data, full-frontal assault that the Semantic Web groupies carried out at Annual.

Roger C. Schonfeld, director of research at Ithaka, again gave the results of his latest, extremely-low-response-rate survey of college faculty and again concluded that universities and colleges really don’t need academic libraries anymore because Ithaka can supply everything colleges and universities need in the way of information.

RDA: Many had prepared extensive presentations about RDA, only to learn a week before the conference that the “big three” libraries will institute it no sooner than in 18 months if at all, taking the wind out of the sails of their talks.

The main thing causing RDA to fail is a bullying article published in 2007 in D-LIB magazine and written by Diane Hillmann and Karen Coyle. They used the article to pressure the Joint Steering Committee to poison RDA and to make the code conform to their needs (as consultants) and to make it MARC unfriendly and library unfriendly. Had they not interfered so much, the code would likely be successfully put in place by now.

Jay Jordan: He announced his retirement and will be departing with the multi-millions of dollars he got from OCLC in both current and deferred compensation. We expect to see a building on OCLC’s Dublin (Ohio) campus named after him soon: The Golden Parachute Building, and I have dibs on making the SACO proposal for it.

Dewpoint: The weather in New Orleans was Venusian: very hot and very humid. The people on Huron Street need a much better location scout. The poor climate makes it harder to benefit from the conference. ALA should stop using its conferences as profit engines for the organization and pay a little extra for more suitable locations.  

—-

Posted via email from Metadata | Comment »