Digital+Research=Blog: July 2011

Saturday, July 23, 2011

Microsoft Research Summit 2011 - day 3

Dinner Cruise
The dinner cruise turned out not to be especially cold, since the ship had large indoor areas where we ate. Microsoft also provided an open bar at this and in fact at all the dinners. Often the wine at such functions is dubious, but even the wine was good and the quality of the selection of micro-brew beer was equally impressive.

Of course the goal of the dinner was not food or drink or even the scenery along the lake, but the conversation among colleagues. I ate with fellow deans from Illinois, Michigan, and Carnegie Mellon, and even if no great research comes from our discussions, collegial discourse is an important social component in the efficient functioning of organizations and projects.

Day 3
I should remember more of day 3 that I do, but jet lag had not quite lost its hold and the morning presentations, while good, left little permanent impression. The main event of day three was in any case the iSchool meeting with Lee Dirks and Alex Wade from Microsoft Research. Lee and Alex gave some sense of the projects they are working on. Lee especially has an interest in long term digital archiving that includes involvement in projects like PLANETS. While testing is an official component of PLANETS, I find that it puts less emphasis on testing than on planing and organization. Testing is, however, what is really needed and that is what I tried to suggest in the meeting -- not, I think, with great success.

The other research aspect that I tried to sell, without much obvious resonance, was ethnographic research on what digital tools people really use and what they really want. Microsoft builds tools and we saw a lot of them that are oriented toward research, but I wonder how well some of them will do in the academic marketplace in the long run. Ethnographic research gives deeper insights into what people understand and misunderstand than do surveys.

Just before I left we held the oral defense of the thesis for one of my very best MA students [1]) whose thesis looked at how a group of literature professors at Humboldt-Universität zu Berlin (which actively supports Open Access) regard Open Access. It was striking how much they misunderstand Open Access and how little they know about it. Most of them would never have filled out a survey. This information would just have slipped away or remained as an anomaly. Microsoft could profit from research like this and had an interest in it in years past. It is less clear that it does today.

[1] Name available on request, with the student's permission.

Tuesday, July 19, 2011

Microsoft Research Summit 2011 - day 2

Cosmos: big data and big challenges.

Pat Helland talked about massively parallel processing based on Dryad using a Petabyte store and made the point that these massive systems process information differently. Database processing in this environment involves tools like SCOPE, which is an SQL-like language that has been optimized for execution over Dryad. Saving this data long term is a problem because they are worried about bit rot. Cosmos keeps at least three copies of the data, checks them regularly, and replaces data that is damaged. Interesting how close this is to LOCKSS (which saves 7 copies).

In talking about how faster processing is not always the solution to processing problems, a speaker quoted Henry Ford as saying: “If I had asked my customers what they wanted, they would have said faster horses... “ (source: Eric Haseltine Long Fuse and Big Bang.)

NUI (Natural User Interfaces)

One of the speakers distinguished between making user interfaces imitate nature and making them feel natural. Non-verbal clues that convey meaning need to be part of the interaction. Another speaker said that we need better feedback systems. An example is a touch screen with a single button. If a user touches it and nothing happens, the user will hit it again and again. When designers changed the button to send out sparkles when touched, the repeated touching stopped, even though sparkles are not a natural result of touching a screen.

In the discussion someone said that we aren't doing science if we can't go back to the data. This suggests a clear separation of data and processing that speakers about very large data sets said is no longer really possible, since the data is usable only with a degree of processing. The speaker was doing fundamentally social science research, which is more human-readable than Big Science data.

Other comments of interest:

Do we need to take the “good” into account in our interpretation of the “natural”?
It's not a machine that we are interfacing to any more, though the speaker is not sure what it is exactly. There is no machine, but a task.

Microsoft clearly has a strong interest in image management, particularly three-dimensional images such are used in medical imaging (doubtless a good market) or gaming. They are dividing a picture into quadrants and creating mathematical representations of the edges in each square to create a hash to search for similar photos. Photosynth.net was also demoed -- it allows the creation of three dimensional images from multiple photos.

Evening Cruise
Microsoft has planned a dinner cruise for the evening. It should be pleasant (I will comment tomorrow), but many of us wanted to go back to the hotel to leave computers, etc., and to change clothes because it is fairly chilly out (despite the heat wave in the rest of the US).

Monday, July 18, 2011

Microsoft Research Summit 2011 - day 1

Microsoft Research invited me to the Summit as part of the iSchool deans group. This blog posting (and several that follow) has my notes and comments.

Plenary

Tony Hey opened the Summit with a talk about changes in scholarly communication, including ways of evaluating output. One of the reasons he left academia had to do with the ranking process at British universities. He emphasized that Microsoft is open (as Steve Jobs recently admitted). Microsoft is now working with the OuterCurve Foundation “to enable the exchange of code and understanding among software companies and open source communities”.

One of the major goals of the conference – a goal that speakers emphasize repeatedly – is to network. This leads to a type of intellectual market in which researchers try to sell their ideas to others who are also trying to sell ideas. Theoretically everyone is a potential idea buyer too, but realistically mostly people want to sell to Microsoft to get research money. This makes Microsoft staff very popular.

As is typical of conferences of computer-oriented people, the wireless network is periodically unable to keep up with the demand. Part of the problem comes from people viewing data-intensive websites related to the presentations (I tried too). Nonetheless it seems like a problem that a corporation like Microsoft should be able to overcome. Happily the problem went away once the plenary session ended. Too many people using too few access points.

Breakout Session: Federal Worlds meet Future Worlds

Howard Schrove from DARPA talked about two models of survivability: the “Fortress” model (which is rigid and doesn't work against an enemy who is already inside) and the “Organism” model (which is adaptable). There is a balance in biology between fixed systems that address known threats, and adaptable systems that address new dangers. The underlying causes of problems in computers come from a few known sources, especially the difference between data and (executable) code. The speaker said that hardware immunity is relatively cheap to develop. Self-adaptive defensive architecture is an adaptive method for software that checks behaviors that compromise it and implements on-the-fly fixes. Instructions sets can be encrypted and randomized. Networking is a vulnerability amplifier, but if the cloud has an operating system that functions essentially as a public health system for the cloud, it may be possible to move the solutions out faster than the attack progresses. A quorum computation can check whether certain systems have been compromised. The result could involve reduced performance and randomization to confuse the attacker. Biosocial concepts are the underpinning of resilient clouds.

Breakout Session: Reinventing Education

Kurt Squire presented some of the educational gaming development that he is working on to get people to have richer experiences with topics like the environment (in a particular consequences for a county in Wisconsin) and medicine (in particular identifying breast cancer). Seth Cooper presented a game called FoldIt, where the goal is to fold biochemicals. Problem-solving is fun and that is part of what makes the games interesting. Tracy Fullerton, a professional game designer, spoke about why traditional assessment takes the fun out of game design. She explained the “yes, and...” game (where you must preface each statement with “yes, and...” rather than “but” or “no”), which helps build collaboration.

Closing plenary

The closing plenary included a variety of speakers. One talked about how Microsoft has been trying to enhance the security of its code. Another spoke about a new app that allows people to write programs on their mobile phones. A developer spoke about echo cancellation for enhancing speech recognition (primarily for gaming). Conclusion: computing research has incredible diversity, and rarely is exclusively "basic" or "applied".

Monday, July 11, 2011

Computational Thinking

I first heard this concept during David De Roure's talk at the the Bloomsbury Conference (see Blog entry for 3 July 2011) and want to take this opportunity to define computational thinking for the sake of my students and to apply it to digital archiving (and related projects).

Definition

“Computational thinking” is (at least in my definition) processing information the way a computer processes it with the existing tools and systems. These tools and systems change over time and computational thinking has to shift over time as well. At present it implies some understanding of, for example, how to use regular expressions to identify specific text strings and how to search indexed information to find matches. Computational thinking tends to be literal (this string with this specification at this time) and tends to be unforgiving (there is no accidental recognition of what the author really meant). Computational thinking is what students ideally learn in their first computer programming class. The “born digital” generation has no advantage here and perhaps even a disadvantage, since they did not have think computationally when they first interacted with computers.

Computational Thinking and Archiving

In class I talked with students about the differences between the implicit definitions of integrity and authenticity in the analog and digital worlds. One of my favorite examples is a marginal note in a book. While we as librarians tend to discourage readers from marking in library books, we would not throw out a book as irreparably damaged because of a marginal note. A marginal note by a famous author can even add value (the Cornell CLASS project is an example). In the digital archiving world, however, we judge integrity by check-sums and hash-values. We do not look at the content, but at whether two or more check-sums agree with each other. Since a marginal comment changes the check-sums of (for example) a PDF file, we would replace that copy in a LOCKSS archive.

If readers wanted to add a marginal comment to a file without changing its integrity (that is, its check-sum), then they could add the comment external to the file with a mechanism (search or index) to locate where it belongs in the original file. This is not necessarily trivial, but is certainly doable as long as content is not regarded as a single file, but as a set of interacting resources. Merely thinking about this choice is computational thinking.

Computational Thinking and digital cultural migration

Computational thinking is needed in order to recognize content that may not be readily comprehensible in future eras. These are words or phrases that will likely be obscure to future (human) readers, but the machine needs specific rules to follow. For example the city of New York is likely to remain familiar in 100 years, but Saigon (now Ho-Chi-Min City) may well be hard for ordinary readers to recognize, unless the name changes back, in which case readers may need help with Ho-Chi-Min City instead.

Of course computational thinking may change substantially when natural speech recognition improves to the point that computer-based comprehension is not fundamentally worse than human comprehension. Or it may require a different type of computational thinking. Any assumptions will likely err in some direction.

Sunday, July 3, 2011

ICE Forum and Bloomsbury Conference

This post reports on two interrelated and back-to-back meetings: the International Curation Education (ICE) Forum (sponsored by JISC) and the Bloomsbury Conference (sponsored by University College London or UCL). Both took place in the Roberts Building at UCL (which is, interestingly enough, next to where I often lived in London in the 1970s at the now-vanished Friends International Centre).

Overlap among the attendees was only partial – I would estimate that about a third of the registered attendees. The University of North Carolina and Pratt hat particularly strong representation, the former because of research projects, the latter because of a summer school for students. This post will not discuss all of the presentations, only a few points that seemed important to me.

ICE Forum

My own talk at the beginning of the ICE Forum addressed the question of whether the world needs digital curators. My answer talked about the need for digital cultural migration to make content comprehensible over long periods of time. The first half explained what this meant and the second looked at how we can design software to help migrate content. When I have talked about this to library groups, the audience largely sees the need as obvious. Many archivists in this audience felt outraged. One argued that archivists ought to leave it to future generations to interpret content. Another listener felt that machine-based interpretation and migration was too mechanical and allowed too little scope for human sense-making – though she grew thoughtful when I suggested that writing code to interpret a file was not fundamentally different than other forms of writing about it. I will say more about digital cultural migration in a future post.

Seamus Ross (Toronto) gave the closing talk at the ICE Forum, in which he quoted Doran Swade that “software is a cultural artifact”. His argument followed my own theme closely in saying that information needs to be annotated and reannotated to be useful for the future. He emphasized the need for case studies like those in law or business school, and he recommended accrediting not the schools but the graduates. We talked about whether effective accreditation was possible without legal requirements and agreed that it would help. Some in the audience disliked the idea of individual accreditation as creating an elite. This did not bother either Seamus or me.

Bloomsbury Conference

Carol Tenopir (Tennessee) discussed a research project to test a hypothesis that scholars who use social media read less. (Turns out that that is not true.) Some of her statistics were especially interesting. Among scholars:

Electronic sources

in 2011 88% of scholarly reading in the UK came from an electronic source (94% of those readings from a library).
In 2005 54% of the scholarly reading in the US was from an electronic source.

Screen reading

In 2011 45% of the scholarly reading was done on the computer screen and 55% of scholars printed a copy.
In 2005 19% of the scholarly reading was done on the computer screen.

While the studies were done in different locations (US & UK) at different times, the expectation is that the country makes no significant difference. A substantial decline of personal subscriptions combined with a substantial improvement in the quality of computer screens could be significant factors. Carol's article is online in PloS One.

David De Roure (Oxford eResearch Centre) talked about Tony Hey's book on the “Fourth Paradigm”. Data-centric research is talked about as if it is new, but (David pointed out) the arts and humanities have done it for a long time. One of the challenges is to get people to think computationally. People also need to stop thinking in terms of “semantically enhanced publication” and to shift their thinking toward “shared digital research objects.” As an alternative to thinking in “paper-sized chunks”, Elsevier now offers an “executable paper grand challenge”. Perhaps Library Hi Tech should too.

Carolyn Hank (McGilll) gave another notable talk. Her dissertation research was on scholars who blog and the blogs themselves. She did purposeful sampling drawing from the academic blog portal. Of 644 blogs 188 fit her criteria and 153 completed the sample. 80% of the authors considered their blog to be a part of the scholarly record. 68% also said that their blog was subject to critical review. 76% believed that their blog led to invitations to present at a conference. 80% would like to have their blogs preserved for access and use for the “indefinite future”.

The last presentation that I hears was by Claire Ross, a doctoral student in digital humanities at UCL. While talking about the effect of social media, she told how she tweeted about her interests when she arrived at UCL and almost immediately got a response from a person at the British Museum that led to a research project. She uses her blog to show her research activities and argued that Twitter enables a more participatory conference culture. I confess that blogging about conferences makes me listen more closely. Perhaps I should try twittering too. Among her (many) interests is the internet-of-things (especially museum objects), which fits well with the Excellence Cluster (Bild Wissen Gestaltung) that we are developing at Humboldt-Universität zu Berlin.