morpheus.parsers.url_parser
Functions
parse (urls[, req_cols]) |
Extract hostname, domain, subdomain and suffix from URLs. |
- parse(urls, req_cols=None)[source]
Extract hostname, domain, subdomain and suffix from URLs.
- Parameters
- urls
- req_cols
URLs to be parsed.
Selected columns to extract. Can be subset of (hostname, domain, subdomain and suffix).
- Returns
- cudf.DataFrame
Parsed dataframe with selected columns to extract.
Examples
>>> from cudf import DataFrame >>> from morpheus.parsers import url_parser >>> >>> input_df = DataFrame( ... { ... "url": [ ... "http://www.google.com", ... "gmail.com", ... "github.com", ... "https://pandas.pydata.org", ... ] ... } ... ) >>> url_parser.parse(input_df["url"]) hostname domain suffix subdomain 0 www.google.com google com www 1 gmail.com gmail com 2 github.com github com 3 pandas.pydata.org pydata org pandas >>> url_parser.parse(input_df["url"], req_cols={'domain', 'suffix'}) domain suffix 0 google com 1 gmail com 2 github com 3 pydata org