Please enter your username below and press the send button.A password reset link will be sent to you.
If you are unable to access the email address originally associated with your Delicious account, we recommend creating a new account.
This link recently saved by hjl on March 19, 2011
Overview of Boilerpipe, maximum subsequence, and text-to-tag ratio clustering for text mining.
"In the world of web scraping, text mining and article reading utilities (readability bookmarklet) there is an ever growing demand for utilities that are capable of distinguishing parts of a HTML document which represent an article apart from other common website building blocks like menus, headers, footers, ads etc."
This link recently saved by hjl on October 22, 2010
Professor Gary Miller, Systems Scientist Ioannis Koutis and Ph.D. student Richard Peng, all of Carnegie Mellon's Computer Science Department, has enormous practical potential. Linear systems are widely used to model real-world systems, such as transportation, energy, telecommunications and manufacturing that often may include millions, if not billions, of equations and variables.
Solving these linear systems can be time consuming on even the fastest computers and is an enduring computational problem that mathematicians have sweated for 2,000 years. The Carnegie Mellon team's new algorithm employs powerful new tools from graph theory, randomized algorithms and linear algebra that make stunning increases in speed possible.
The algorithm, which applies to an important class of problems known as symmetric diagonally dominant (SDD) systems, is so efficient that it may soon be possible for a desktop workstation to solve systems with a billion variables in just a few seconds.
This link recently saved by hjl on March 30, 2010
Comments on data mining and approaches to graph analysis. "In this Google Tech Talk video, Running Large Graph Algorithms - Evaluation of Current State-of-the-Art and Lessons Learned, Andy Yoo, a Computer Scientist at Lawrence Livermore Labs, says the idea behind Data Science is to: fuse different forms of data from different sources into a graph, run graph mining algorithms, and extract out useful information. Data Science tries to understand unstructured data. Scientific data is usually structured, but web data is highly unstructured, consisting of text, web pages, sensor data, images, and so on."
This link recently saved by hjl on February 16, 2010
We present Aardvark, a social search engine. With Aardvark, users ask a question, either by instant message, email, web input, text message, or voice. Aardvark then routes the question to the person in the user’s extended social network most likely to be able to answer that question. As compared to a traditional web search engine, where the challenge lies
in finding the right document to satisfy a user’s information need, the challenge in a social search engine like Aardvark lies in finding the right person to satisfy a user’s information
need. Further, while trust in a traditional search engine is based on authority, in a social search engine like Aardvark, trust is based on intimacy. We describe how these considerations
inform the architecture, algorithms, and user interface of Aardvark, and how they are reflected in the behavior of Aardvark users.
This link recently saved by hjl on November 08, 2009
Sample of some social network analysis using R. "The figure below is the 2-core of the full network. As you might imagine, the full network is quite large, but the vast majority of nodes are pendants; therefore, visualizing the 2-core is much more useful. Those that twittered with #rstats are in red and labeled, while all of the intervening nodes are in blue."
This link recently saved by hjl on October 16, 2009
SimMetrics - An open source collection of string distance metrics. "In my investigations into string metrics, similarity metrics and the like I have developed an open source library of Similarity metrics called SimMetrics. SimMetrics is an open source java library of Similarity or Distance Metrics, e.g. Levenshtein distance , that provide float based similarity measures between String Data. All metrics return consistant measures rather than unbounded similarity scores. This open source library is hosted at http://sourceforge.net/projects/simmetrics/"
This link recently saved by hjl on September 27, 2009
From NIST Software and Systems Division, Information Technology Laboratory. "This is a dictionary of algorithms, algorithmic techniques, data structures, archetypal problems, and related definitions. Algorithms include common functions, such as Ackermann's function. Problems include traveling salesman and Byzantine generals. Some entries have links to implementations and more information. Index pages list entries by area and by type. The two-level index has a total download 1/20 as big as this page. "
This link recently saved by hjl on July 12, 2009
I want one of these! "We have created scalable infrastructure, named Pregel, to mine a wide range of graphs. In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges' states, and mutate the graph's topology (experts in parallel processing will recognize that the Bulk Synchronous Parallel Model inspired Pregel). Currently, Pregel scales to billions of vertices and edges, but this limit will keep expanding. Pregel's applicability is harder to quantify, but so far we haven't come across a type of graph or a practical graph computing problem which is not solvable with Pregel. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use. Implementing PageRank, for example, takes only about 15 lines of code. "
This link recently saved by hjl on June 27, 2009
Comments on a SIGIR 2008 paper from Yahoo Research, "ResIn: A Combination of Results Caching and Index Pruning for High-performance Web Search Engines", regarding performance effects of caching most common queries and the interaction with subsequent statistics used to generate pruned indices.
This link recently saved by hjl on June 27, 2009
For sponsored search, ads are associated with bids. When a user issues a search query, bids are typically matched to the query using broad-match semantics: all the terms in the bid need to be in the query (but not vice versa). This means that the roles of the query and the bid/document are reversed in sponsored search, in turn making standard retrieval techniques based on inverted indexes ill-suited for sponsored search. This paper proposes novel index structures and query processing algorithms for sponsored search. We evaluate these structures using a real corpus of 180 million advertisements.