1
votes

I am working on a project that requires me to search through pubmed using inputs from an Excel spreadsheet and print counts of the results. I have been using xlrd and entrez to do this job. Here is what I have tried.

  1. I need to search through pubmed using the name of the author, his/her medical school, a range of years, and his/her mentor's name, which are all in an Excel spreadsheet. I have used xlrd to turn each column with the required information into lists of strings.

    from xlrd import open_workbook
    book = xlrd.open_workbook("HEENT.xlsx").sheet_by_index(0)
    med_name = []
    for row in sheet.col(2):
        med_name.append(row)
    med_school = []
    for row in sheet.col(3):
        med_school.append(row)
    mentor = []
    for row in sheet.col(9):
        mentor.append(row)
    
  2. I have managed to print the counts of my specific queries using Entrez.

    from Bio import Entrez
    Entrez.email = "[email protected]"
    handle = Entrez.egquery(term="Jennifer Runch AND ((2012[Date - Publication] : 2017[Date - Publication])) ")
    handle_1 = Entrez.egquery(term = "Jennifer Runch AND ((2012[Date - Publication] : 2017[Date - Publication])) AND Leoard P. Byk")
    handle_2 = Entrez.egquery(term = "Jennifer Runch AND ((2012[Date - Publication] : 2017[Date - Publication])) AND Southern Illinois University School of Medicine")
    record = Entrez.read(handle)
    record_1 = Entrez.read(handle_1)
    record_2 = Entrez.read(handle_2)
    pubmed_count = []
    for row in record["eGQueryResult"]:
        if row["DbName"] == "pubmed":
            pubmed_count.append(row["Count"])
    
    for row in record_1["eGQueryResult"]:
        if row["DbName"] == "pubmed":
             pubmed_count.append(row["Count"])
    
    for row in record_2["eGQueryResult"]:
         if row["DbName"] == "pubmed":
             pubmed_count.append(row["Count"])
    print(pubmed_count)   
    >>>['3', '0', '0']
    

    The problem is that I need to replace the student name ("Jennifer Runch") with the next student name in the list of student names("med_name"), the medical school with the next school, and the current mentor's name with the next mentor's name from the list.

I think I should write a for loop after declaring my email to pubmed, but I am not sure how to link the two blocks of code together. Does anyone know of an efficient way to connect the two blocks of code or know how to do this with a more efficient way than the one I have tried? Thank you!

1
Is the question still open?Maximilian Peters
@MaximilianPeters I have some follow up question on this one, which I posted to another thread: stackoverflow.com/questions/40161460/…Molly Zhao

1 Answers

1
votes

You got most of the code in place. It just needed to be modified slightly.

Assuming your table looks like this:

Jennifer Bunch  |Southern Illinois University School of Medicine|Leonard P. Rybak
Philipp Robinson|Stanford University School of Medicine         |Roger Kornberg

you could use the following code

import xlrd
from Bio import Entrez
sheet = xlrd.open_workbook("HEENT.xlsx").sheet_by_index(0)

med_name = list()
med_school = list()
mentor = list()
search_terms = list()
for row in range(0, sheet.nrows):
    search_terms.append([sheet.cell_value(row, 0), sheet.cell_value(row,1), sheet.cell_value(row, 2)])

pubmed_counts = list()

for search_term in search_terms:
    handle = Entrez.egquery(term="{0} AND ((2012[Date - Publication] : 2017[Date - Publication])) ".format(search_term[0]))
    handle_1 = Entrez.egquery(term = "{0} AND ((2012[Date - Publication] : 2017[Date - Publication])) AND {1}".format(search_term[0], search_term[2]))
    handle_2 = Entrez.egquery(term = "{0} AND ((2012[Date - Publication] : 2017[Date - Publication])) AND {1}".format(search_term[0], search_term[1]))
    record = Entrez.read(handle)
    record_1 = Entrez.read(handle_1)
    record_2 = Entrez.read(handle_2)

    pubmed_count = ['', '', '']

    for row in record["eGQueryResult"]:
        if row["DbName"] == "pubmed":
            pubmed_count[0] = row["Count"]
    for row in record_1["eGQueryResult"]:
        if row["DbName"] == "pubmed":
            pubmed_count[1] = row["Count"]
    for row in record_2["eGQueryResult"]:
        if row["DbName"] == "pubmed":
            pubmed_count[2] = row["Count"]

    print(pubmed_count)
    pubmed_counts.append(pubmed_count)

Output


['3', '0', '0']
['1', '0', '0']

The required modification is to make the queries variable using format.

Some other modifications which are not necessary but might be helpful:

  • loop over the Excel sheet only once
  • store the pubmed_count in a predefined list because if values come back empty, the size of the output will vary making it hard to guess which value belongs to which query
  • everything could be even further optimized and prettified, e.g. store the queries in a list and loop over them which would give less code repetition but now it does the job.