My research in information visualization focuses on enabling large-scale text visualization and text analysis through the application of machine learning techniques and on making use of visualizations to support machine learning research. I take a human-centered approach to studying, designing, and evaluating model-driven visual analysis tools.
|Information Curriculum Vitae Research Statement Contact 650.450.0924 jcchNOSPAMuangNOSPAM@NOSPAMcsNOSPAM.NOSPAMstanfordNOSPAM.NOSPAMedu Gates Building 382|
Stanford Dissertation Browser
The Dissertation Browser enables the exploration of 9,000 Ph.D. dissertations by topical similarity and by year. We created the browser to help social scientists visualize university-wide research output and examine the impact of inter-disciplinary collaborations.
We describe our experiences building the Dissertation Browser in our CHI 2012 paper, and distill a set of design guidelines and design processes for model-driven visualizations. We sought analyst and expert feedback throughout the project, and iteratively modified both the visualization and the underlying topic model to address interpretation and trust issues that hinder analysis. Our iterative design process led to a novel topic similarity measure based on word borrowing.
Termite: Topic Model Visualization
Termite is a visual analysis tool designed for builders and users of statistical topic models. Our tool enables more rapid and accurate evaluation of topic model quality.
As detailed in our AVI 2012 paper, we incorporate a matrix view to support the assessment of topical term distributions and enable the comparison of latent topics. We devised a saliency measure to highlight distinctive vocabulary. We developed a seriation algorithm that re-orders words to reveal the clustering of related terms and promote the legibility of multi-word phrases.
We studied how people summarize text using descriptive phrases, and developed a novel algorithm for extracting keyphrases from documents.
In our TOCHI 2012 article, we describe our user study on human-generated keyphrases. We systematically examined linguistic features predictive of high-quality summary terms, and developed a model for the automatic extraction of descriptive phrases from text. We discuss issues of specificity and redundancy identified through user evaluations, and proposed additional algorithms that enable adaptive selection of keyphrases. Finally, we demonstrate how our algorithms enable novel text visualization designs.
We examined how multiple experts organize research output from the InfoVis community. Our work quantifies the limitations of text analysis based on word statistics, but also points to potential directions for topic model research.
We developed a survey method for eliciting and aggregating topical organization from multiple experts. Our survey results quantify how much people define topics through shared words or documents. and enable the evaluation of topic models directly in terms of experts' organization of a domain. We constructed theoretically “optimal” word-based topic models from the collected data, and compared the performance of topic models created from abstracts vs. full text documents.
Mapping Intellectual Changes in Academia
We created various visualizations to display the research output from 200 U.S. universities, based on an analysis of 1.05 million Ph.D. dissertations. The tools help social scientists analyze knowledge transfer among academic disciplines.
Using our tools, the machine learning researchers fine-tuned and verified the stability of the underlying topic model. The visualizations also facilitated communication among collaborating researchers from different disciplines so the final model reflects expert feedback from multiple disciplines. The visualizations and topic-based analyses led to various findings such as a growing split between molecular and ecological forms of biology, and changes driven by the rise of gender and ethnic studies.
History of Computational Linguistics
This visualization shows 45 years of history in computational linguistics based on 15,000 published papers and the flow of topics along 61,000 citations.
Our visualization enabled detailed examination of the lines of research (as predicted by Topic Flow algorithm) by both the model builders and experts in the field. The tool revealed previously unknown issues in the algorithm such as unintended accumulation of flows due to cycles in the citation graph.
Frog Gene Visualization
Sentiment Tree Visualization
This visualization displays word sentiments within a sentence and enables our collaborating machine learning researchers to explore the output of their sentiment model, and compare model prediction to ground truth data.
We present a probabilistic model for quantifying the effects of languages on color perception, based on an analysis of the World Color Survey (color naming data from 110 languages) and English color naming data collected on the web. In our CIC 2008 paper, we demonstrate that our model can identify well-named regions of the color space.
Semantic Text Zooming
Our text shortening algorithm can progressively shorten phrases (2 to 8 words in length) based on examples from Wikipedia. We demonstrate how our technique can enable adaptive resizing of text visualization to fit small displays.