Tuesday, September 19, 2006

Librarian Has 130 Million Reasons To Get Good At Digitizing Content

Here's a good interview for you to look at!
Librarian Has 130 Million Reasons To Get Good At Digitizing Content


One way people try to describe a large amount of data is to say it's a certain percentage of what's stored in the Library of Congress. Rightly or wrongly, that repository of American records has become a yardstick for measuring big.

So how much information does the Library of Congress hold? Truth be told, it's a moving target — the amount's always changing. A natural question that follows: How does the library handle all that data? IBD asked just that of Carl Fleischhauer, a 30-year Library of Congress veteran whose job involves figuring such things out.

Technically, he's program officer for the National Digital Information Infrastructure and Preservation Program. Practically, he's a data orchestrator.

IBD: What's at the Library of Congress? How much information does that amount to?

Fleischhauer: I think the picture people have in their minds is a huge number of books, and it's all the letters and words in the books. But . . . it's complicated.

The figures we give at our Web site are that we've got 29 million books. We've also got other printed materials, 2.7 million sound recordings, 12 million photographs, 4.8 million maps, 5 million music items — which in this case means scores or printed music — and 58 million manuscripts.

Manuscripts tend to be numerous because you count every letter page. So if you count George Washington's papers, you've got a lot of letters to him and from him.

That's how we come up with what we like to call 130 million items altogether.

IBD: How much is being put into electronic form?

Fleischhauer: We probably have 1% or 2% digitized. In the next decade, we'd really be tickled if we could digitize 10% of the collection. It's a big job.

We're certainly interested in what other people are digitizing.

Google is keen on doing a huge number of books. So is what's called the Open Content Alliance.

Speaking on behalf of the taxpayers, if there are private funds being invested (in digitizing books) we certainly don't want to duplicate that. That's why our digitizing focuses on our unique collections. We've got a lot of photos, manuscripts and very rare motion pictures from the 1890s.

IBD: What issues did you have to iron out before beginning to digitize?

Fleischhauer: The technology issue that is central with a lot of these originals in old formats is getting the stuff off the old thing. We're doing historically important books, so we don't want to take a knife and cut them out and run them through a sheet feeder. You need to turn pages and do text conversion.

It's fairly labor intensive. For old phonograph records you've got to play the darned thing. We've done some folk recordings on aluminum and acetate disks by Alan Lomax (who recorded much original American folk music). Each has to be set down on a turntable and a stylus selected and set just right. Some of the early motion pictures are on paper or very old film...

No comments: