In the Trustworthiness group (TW), we work on a diverse analysis of reliability, credibility and explainability on the Web utilizing different types of data ranging from tweets, fake news, phishing and so on. We conduct a wide range of research themes including fake news detection, reliable website detection and its explainability, phishing detection, article credibility, sentiment analysis, and anomaly detection of IoT devices.
Phishing URL Detection
Phishing is a type of personal information theft in which phishers lure users to steal sensitive information. Phishing detection mechanisms using various techniques have been developed. Our hypothesis is that phishers create fake websites with as little information as possible in a webpage, which makes it difficult for content- and visual similarity-based detections by analyzing the webpage content. To overcome this, we focus on the use of Uniform Resource Locators (URLs) to detect phishing.
Segmentation-based Phishing URL Detection
Information extracted from URLs might indicate significant and meaningful patterns essential for phishing detection. To enhance the accuracy of URL-based phishing detection, we need an accurate word segmentation technique to split URLs correctly. However, in contrast to traditional word segmentation techniques used in natural language processing (NLP), URL segmentation requires meticulous attention, as tokenization, the process of turning meaningless data into meaningful data, is not as easy to apply as in NLP. In our work, we concentrate on URL segmentation to propose a novel tokenization method, named URL-Tokenizer, by combining the Bert tokenizer.
Phishing URL Detection using Information-rich Domain and Path Features
We define the features, extracted from raw URLs directly, as Information-rich features, such as words/characters, which we transform to integer-encoded vector representation. Simply, we extract words or characters from the URL text and consider them as features themselves. Such features are neither risky to detection rate like manually generated features. We define such features as information-rich features as they contain useful information (e.g., alphanumeric characters and meaningful words). As we aim to overcome the bottleneck of manually generated features – i.e, fixed features where phishers can bypass by a small change of URL structures and it requires not only the knowledge of featuring engineering experts but also enough durability to make sure phishers cannot easily deceive. Thus, we target information-rich features by extracting meaningful words from raw URLs.
URL-based Phishing Detection using the Entropy of Non-Alphanumeric Characters
Non-alphanumeric (NAN) characters are useful for phishing detection because phishers tend to create fake URLs with NAN characters such as:
1. extra unnecessary dots
2. “//” to redirect the user to a completely different domain
3. “-” in the domain to mimic a similar website name
4. unnecessary symbols
Previous studies have also extracted the frequencies of specific special characters such as “-”, “//”, “_”, and “.” in each URL. However, instead of directly using the frequencies of NAN characters found in URLs, we propose a new feature called the entropy of NAN characters for URL-based phishing detection to measure the distributions of these special characters between phishing and legitimate websites. Our objective is to develop a new feature that is useful in URL-based phishing detection whenever little or no information is available in a webpage of the phishing website.
Systematic Investigation of Social Context Features for Fake News Detection
In our current world, world-wide news has never been as easily accessible and almost every person uses social media to communicate with each other. This has opened the doors for malicious people to spread fake news. It is important to detect misinformation on the web, as it can harm people in various ways. Instead of only looking at the news content to decide if it’s trustworthy or not, social context features can also be used to detect fake news. We use features such as sentiment analysis, TF-IDF of the tweet, psycholinguistic features (Empath) of the tweets, user follower count and user statuses count. Using these features, we train a machine learning classifier and determine the tweet’s trustworthiness.
Unreliable Website Detection
As more people start to rely on the internet for their daily news and information, the need to detect unreliable websites has risen. While previous work focuses on tackling this problem by looking at linguistic and social features, we propose a new set of features that focuses on the performance and usability of the webpages in question. Google Lighthouse (https://developers.google.com/web/tools/lighthouse?hl=es) is an open-source, automated tool for improving the quality of the web pages developed by Google. We can get metrics and scores that measure a webpage’s performance and usability from 5 different perspectives: Performance, Accessibility, Best Practice, Search Engine Optimization (SEO), and Progressive Web App (PWA).
Credibility Check in Articles
In recent years, news containing false information called fake news has become a problem, and it has been revealed that it spreads more widely and rapidly than factual news. Therefore, an automatic credibility checking system that can be executed in a short period of time is required. Considering the number of authors involved in writing and editing the article can be reflected in the credibility and quality of the article’s content, it can be a useful metric to use. In order to apply this evaluation metric for credibility measurement, it is necessary to estimate the number of authors from sentences. We propose a method to detect changes in the writing style of sentences based on the frequency of occurrence of part-of-speech n-gram and estimate the number of authors to check credibility of the article.