Translate

12 fevereiro 2011

Similaridade na pesquisa

Informa Pedro Correia: pesquisa realizada na universidade de Stanford mostra as semelhanças entre a linguagem utilizada nas teses de doutorado entre os departamentos da universidade. Não aparece departamento de contabilidade. Não obstante, acredito que esteja inserido na escola de negócios. Contudo, é interessante a animação e o estudo realizado. Segue o link.

Aqui o resumo:

The Stanford Dissertation Browser is an experimental interface for document collections that enables richer interaction than search. Stanford's PhD dissertation abstracts from 1993-2008 are presented through the lens of a text model that distills high-level similarity and word usage patterns in the data. You'll see each Stanford department as a circle, colored by school and sized by the number of PhD students graduating from that department.
When you click a department, it becomes the focus of the browser and every other department moves to show its relative similarity to the centered department. The similarity scores are computed using a supervised mixture model based on Labeled LDA: every dissertation is taken as a weighted mixture of a unigram language model associated with every Stanford department. This lets us infer, that, say, dissertation X is 60% computer science, 20% physics, and so on. These scores are averaged within a department to compute department-level statistics (the similarities shown), and need not be symmetric. For instance, Economics dissertations at Stanford use more words from Political Science than vice versa. Essentially, the visualization shows word overlap between departments measured by letting the dissertations in one department borrow words from another department. Which departments borrow the most words from which others? The statistics are computed for each year in the data.

Nenhum comentário:

Postar um comentário