Parse domain names with tldextract

Ulf Hamster 3 min.
python tldextract domain name hostname top level domain subdomain country

A domain name can consist of up to three parts [subdomain].[hostname].[tld].

The top-level domain (TLD) refers to the authority where the domain name [hostname].[tld] has been registered. Usually the TLD refers to a country (e.g. .es for Spain, co.uk for United Kingdom, etc.), and thus, can be used as country indicator. However, purpose related TLDs are very common as well (e.g. .com for commerical, .gov for public entities). In total the TLD is a kind of classification scheme on its own, and we don't need to maintain it nor dream it up on our own.

Subdomains can be created by the owner of a specific [hostname].[tld]. A subdomain might refer to a department of a large organization. Off-the shelf CMS and social media sites might refer to user accounts, e.g. [blogname].wordpress.com.

The hostname is usually a brand name or kind of. An individual might registered all kind of TLDs janedoe.[tld]. Obviously, corporations are doing the same, e.g. yahoo.com, yahoo.co.uk, yahoo.jp, etc. If you have a feature with domain names, it make sense to parse just the hostname (without TLDs, without subdomains) to identify a brand name.

Parse string with domain

The package tldextract returns the subdomain, hostname (domain) and tld (suffix).

import tldextract
print(tldextract.extract('subhello.example.co.uk'))
ExtractResult(subdomain='subhello', domain='example', suffix='co.uk')

An empty string is returned if the information is missing. For example [hostname].[tld] will return subdomain='', just [hostname] returns empty subdomain='' and empty suffix=''.

print(tldextract.extract('justthehostname.com'))
print(tldextract.extract('justthehostname'))
ExtractResult(subdomain='', domain='justthehostname', suffix='com')
ExtractResult(subdomain='', domain='justthehostname', suffix='')

The package tldextract maintains a list of top-level domains. Thus it can distinct it from hostnames. For example "com" is identified as TLD an not as hostname. Or naive implementation might confuse wrongly "co.uk" as domain='co' and `suffix='uk``.

print(tldextract.extract('com'))
print(tldextract.extract('co.uk'))
ExtractResult(subdomain='', domain='', suffix='com')
ExtractResult(subdomain='', domain='', suffix='co.uk')

Cast Results

Just a quick note

sub, dom, tld = tldextract.extract('subhello.example.co.uk')
sub, dom, tld
('subhello', 'example', 'co.uk')
aslist = list(tldextract.extract('subhello.example.co.uk'))
aslist
['subhello', 'example', 'co.uk']

Loop a list of domain names

import tldextract
import pandas as pd

# given list of domain names
data = ['acme.com', 'acme.jp', 'nonsense.es', 'evil.acme.net', 'other.co.uk', '', None, 'acme.es']

# loop
parsed_list = [('', '', '') if pd.isnull(s) else tuple(tldextract.extract(s)) for s in data]

# show me
parsed_df = pd.DataFrame(parsed_list, columns=['sub', 'host', 'tld'])
parsed_df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

How is that useful?

the orginal domains are all unique

from collections import Counter
Counter(data)
Counter({'acme.com': 1,
         'acme.jp': 1,
         'nonsense.es': 1,
         'evil.acme.net': 1,
         'other.co.uk': 1,
         '': 1,
         None: 1,
         'acme.es': 1})

However, the hostname "acme" is quiet common

Counter(parsed_df['host'])
Counter({'acme': 4, 'nonsense': 1, 'other': 1, '': 2})

The most commain TLD is spanish

Counter(parsed_df['tld'])
Counter({'com': 1, 'jp': 1, 'es': 2, 'net': 1, 'co.uk': 1, '': 2})

Conclusion

Decomposing the domain names result in possibly two useful features

  1. The brand name (hostname)
  2. the country/purpose (tld)

The variety of these two features is less than both combined what can be advantageous for encoding and vectorization.