How to Get Encoding Format of URL in Python?
As a blogger and a programmer, I often encounter the need to get the encoding format of a URL in Python. Recently, while working on a project that involved web scraping, I came across a situation where I had to decode the URL to its original format. Here's how I did it:
Using urlparse library
The easiest way to get the encoding format of a URL in Python is to use the urlparse
library. This library is part of the Python Standard Library and provides a way to parse URLs into their component parts. Here's how to use it:
from urllib.parse import urlparse
url = "https://www.example.com/path/to/page.html?query=parameter"
parsed_url = urlparse(url)
print(parsed_url.scheme) # output: https
print(parsed_url.netloc) # output: www.example.com
print(parsed_url.path) # output: /path/to/page.html
print(parsed_url.query) # output: query=parameter
In the above code, we first import the urlparse
library from the urllib.parse
module. Then, we define a URL string and pass it to the urlparse()
function. This function returns an object that contains the parsed components of the URL. We can then access these components using the object's attributes.
Using urllib.parse.unquote()
If you want to decode the URL to its original format, you can use the unquote()
function from the urllib.parse
module. This function replaces all %xx escapes with their corresponding ASCII characters. Here's an example:
from urllib.parse import unquote
url = "https://www.example.com/path/to/page.html?query=Hello%20World%21"
decoded_url = unquote(url)
print(decoded_url) # output: https://www.example.com/path/to/page.html?query=Hello World!
In the above code, we first import the unquote()
function from the urllib.parse
module. Then, we define a URL string that contains encoded characters. We pass this URL string to the unquote()
function, which returns the URL in its original format.
Using requests library
If you're working with web pages and want to extract the encoding format from the HTTP headers, you can use the requests
library. This library allows you to send HTTP requests using Python and retrieve the response headers. Here's an example:
import requests
url = "https://www.example.com"
response = requests.get(url)
print(response.encoding) # output: UTF-8
In the above code, we first import the requests
library. Then, we define a URL string and send an HTTP GET request using the requests.get()
function. This function returns a Response
object that contains the response headers. We can then access the encoding format using the encoding
attribute.
These are some of the ways to get the encoding format of a URL in Python. Depending on your specific use case, you can choose the method that works best for you.