How to Extract Email Address from Resumes

| | 2 min read

Processing a lot of resumes is a challenge for any department trying to hire people. With the right tools at hand, nothing is difficult to achieve, especially for the MarTech team at Zyxware. Here is a note on how we extracted the required information from PDF files to process the job applications we received for out digital marketing executive post.

If you have not seen my previous post on extracting candidate details from your Linkedin job post please read that first. It will give you the full context.

The information I extracted from the Linkedin email does not have the candidate's email address. Email address is essential to establish a connection. Fortunately, the data include the link to the resume and the resume have email address.

I downloaded the resumes, but the resumes are in different formats. Some are in doc, some are in Docx, and some are in PDF.

To get everything as PDF, I use the following command provided by Libreoffice.

lowriter --headless --convert-to pdf *.doc

And

lowriter --headless --convert-to pdf *.docx

There are tools that can read pdf directly, but I go with converting the PDFs to text and then search for the emails from there.

The pdf2txt.py utility converts all the PDF files to text documents. A small bash script was required for that.

for i in *.pdf
do 
  echo $i
  pdf2txt.py "$i" --outfile "txt/${i%.pdf}.txt"
done

This will extract all my pdf files to text and store it in the txt directory.

The final step was to extract the required details from the text file. I wrote a simple python script that goes through each of the text file and search for the email regular expression to pick the available email addresses.

import os, re, csv, glob

csv_columns=['filename', 'name', 'email']
out = open("emails.csv",'w')
writer = csv.DictWriter(out, fieldnames=csv_columns)
writer.writeheader()
regex = re.compile("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+")
data = {}
files = glob.glob("*.txt")
for filename in files:
  with open(os.path.join(os.getcwd(), filename), 'r') as f: # open in read-only mode
    data["filename"] = filename
    data["name"] = filename.split('_')[0] #In my case, the first part of the file name is the full name of the candidate.
    s = f.read().lower()
    email = re.findall(regex, s)
    if len(email) != 0:
      data['email'] = email[0]
  writer.writerow(data)

out.close()

Now we are all set with sending the first assignment to the candidates who applied for this position.

If you are looking for similar automation support for your marketing team, do contact us. If you are intrested in joining our team, please connect me on Linkedin and we can discuss.

A big thanks to the Free Software community to provide the tools I use everyday which make our work life as productive as possible.