A malicious Python package called “onyxproxy” was discovered on PyPI, the official repository for Python packages, using Unicode as an obfuscation technique to steal and exfiltrate developers’ account credentials and other sensitive data from compromised devices.
The malicious package uses a combination of different Unicode fonts in the source code to evade detection by automated scans and defenses that identify potentially malicious functions based on string matching.
Cybersecurity specialists at Phylum discovered the onyxproxy package, which contained a “setup.py” package with thousands of suspicious code strings that use a mix of Unicode characters.
The package has since been removed from PyPI, but it amassed 183 downloads since its publication on the platform on March 15.
Unicode is a comprehensive character encoding standard that unifies various sets/schemes under a common standard covering over 100,000 characters to maintain interoperability and consistent text representation across different languages and platforms and eliminate encoding conflicts and data corruption issues.
The use of Unicode characters for identifiers, such as code variables, functions, classes, modules, and other objects in Python, allows coders to create identifiers that appear identical yet point to different functions. In the case of onyxproxy, the authors used the identifiers “import,” “subprocess,” and “CryptUnprotectData,” which are larger and have a vast number of variants, easily beating string-matching-based defenses.
This Unicode support in Python can be easily abused to hide malicious string matches, making code appear innocuous while still performing malicious behavior.
Although the obfuscation method used by the onyxproxy package isn’t particularly sophisticated, it is worrying to see it employed in the wild and might be a sign of broader abuse of Unicode for Python obfuscation.
The risks of Unicode in Python have been extensively discussed in the Python development community in the past.
In November 2021, academic researchers presented a theoretical attack called “Trojan Source” that used Unicode control characters to inject vulnerabilities into source code while making it harder for human reviewers to detect those malicious injections.
Therefore, defenders must implement more robust detection mechanisms against these emerging threats to prevent these attacks from happening again.