PhD Thesis


February 2013


In the real world, many fields have highly skewed class distributions and features that vary dramatically in terms of their classification and runtime performance. With a huge volume of data on the web, such fields typically require machine learning (ML) techniques with low latency and high performance. Anti-phishing is one of those fields, which requires a very low False Positive Rate (FP), a reasonably high True Positive Rate (TP) and a fast response time. In those great number of areas including anti-phishing, however, almost all existing ML-based approaches simply focused on designing features, and building a monolithic model using them all at once. A fast response time is of paramount importance to the user experience in a live scenario, and naively extracting values for all features upfront is often an overkill. In our previous work, we proposed a number of anti-phishing approaches that either extend existing URL blacklists in a probabilistic fashion or enhance feature-based anti-phishing methods with novel features, and in this thesis, we build on our previous experience with anti-phishing and propose a feature-type-aware cascaded learning framework for the a variety of domains with skewed class distribution and features with various classification and runtime performance in an effort to achieve a good balance between the three desiderata of TP, FP and latency. By utilizing lightweight features in early stages of the cascade and postponing prohibitive ones to later stages, our approach achieves a superior runtime performance in general, and can be further improved via parallelization in the distributed computing environment. Moreover, our approach is scalable with more features, and can be optimized in favor of FP or TP based on the specific domains. In the context of anti-phishing, our cascaded approach achieves 55:7% reduction in runtime on average over traditional single-stage models, with a low FP of 0:65% and a TP of 83:34%, and thus provides a fast and reliable solution for live detection scenarios.

Adobe acrobat reader