morpheus.parsers.url_parser#

Functions

parse(urls[, req_cols])

Extract hostname, domain, subdomain and suffix from URLs.

parse(urls, req_cols=None)[source]#

Extract hostname, domain, subdomain and suffix from URLs.

Parameters:
urlsSeriesType

URLs to be parsed.

req_colstyping.Set[str]

Selected columns to extract. Can be subset of (hostname, domain, subdomain and suffix).

Returns:
DataFrameType

Parsed dataframe with selected columns to extract.

Examples

>>> from cudf import DataFrame
>>> from morpheus.parsers import url_parser
>>>
>>> input_df = DataFrame(
...     {
...         "url": [
...             "http://www.google.com",
...             "gmail.com",
...             "github.com",
...             "https://pandas.pydata.org",
...         ]
...     }
... )
>>> url_parser.parse(input_df["url"])
            hostname  domain suffix subdomain
0     www.google.com  google    com       www
1          gmail.com   gmail    com
2         github.com  github    com
3  pandas.pydata.org  pydata    org    pandas
>>> url_parser.parse(input_df["url"], req_cols={'domain', 'suffix'})
   domain suffix
0  google    com
1   gmail    com
2  github    com
3  pydata    org