Here’s a quick method to get HGNC symbols and names that draws upon data from UCSC and the open source MyGene.info project:
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/refGene.txt.gz | gunzip -c | cut -f13 | sort | uniq | get_hgnc_names_for_symbols.py > hgnc_symbols_with_names.txt
There’s a Python script in there that I call get_hgnc_names_for_symbols.py:
#!/usr/bin/env python import sys from mygene import MyGeneInfo hgnc_symbols = [] for line in sys.stdin: hgnc_symbols.append('%s' % (line.strip())) mg = MyGeneInfo() results = mg.querymany(hgnc_symbols, scopes='symbol', species='human', verbose=False) for result in results: sys.stdout.write("%s\t%s\n" % (result['symbol'], result['name']))
The pipeline above writes a two-column text file called hgnc_symbols_with_names.txt that contains the HGNC symbol (e.g., AAR2) and its name (e.g., AAR2 splicing factor homolog), which could be put into a lookup table or, given that it is sorted, could be searched very quickly with a binary search via the Python bisect library.