[Translation] Python - an assistant in finding cheap air tickets for those who love to travel

[Translation] Python - an assistant in finding cheap air tickets for those who love to travel


The author of the article, the translation of which we publish today, says that its goal is to tell about the development of a Python web scraper using Selenium, which searches for airfares. When searching for tickets, flexible dates are used (+ - 3 days relative to specified dates). Scraper stores the search results in an Excel file and sends an email to the person who launched it, with general information about what he was able to find. The objective of this project is to assist travelers in finding the best deals.



If you feel that you are lost with the material, take a look at this article.

What are we going to look for?


You are free to use the system described here as you like. For example, I used it to search for weekend tours and tickets to my hometown. If you are serious about finding tickets - you can run the script on the server (a simple server , for 130 rubles per month , it is quite suitable for this) and to make it so that it would run once or twice a day. Search results will be emailed to you. In addition, I recommend setting everything up so that the script would save the Excel file with the search results in the Dropbox folder, which will allow you to view such files from anywhere and at any time.


I haven’t found tariffs with errors yet, but I believe that this is possible

When searching, as already mentioned, the “flexible date” is used, the script finds sentences that are within three days from the given dates. Although when you run the script, you search for proposals in only one direction, it is easy to modify it so that it can collect data in several flight directions. With it, you can even look for erroneous rates, such finds can be very interesting.

Why do we need another web scraper?


When I first started web scraping, I, frankly, it was not particularly interesting. I wanted to do more projects in the field of predictive modeling, financial analysis, and, perhaps, in the field of analyzing the emotional coloring of texts. But it turned out that it is very interesting - to understand how to create a program that collects data from websites. As I delved into this topic, I realized that web scraping was the “engine” of the Internet.

Perhaps you decide that this is too bold a statement. But think about how Google started with a web scraper that Larry Page created using Java and Python. Google robots have been researching and researching the Internet, trying to provide their users with the best answers to their questions. Web scraping has an infinite number of applications, and even if you are interested in something else in the field of Data Science, you will need some of the scraping skills to acquire data for analysis.

I found some of the techniques used here in a great book about web scraping, which I recently acquired. In it you can find a lot of simple examples and ideas on the practical application of the studied. In addition, there is a very interesting chapter on bypassing reCaptcha checks. For me, this was news, because I didn’t know that there were special tools and even entire services for solving such problems.

Love to travel?!


The simple and fairly innocuous question in the heading of this section can often be answered with a positive answer, accompanied by a couple of stories from the journeys of the one to whom it was asked. Most of us would agree that traveling is a great way to dive into new cultural environments and expand our own horizons. However, if you ask someone if he likes to look for flights, I am sure that the answer to him will be far from being so positive. Basically, Python comes to the rescue.

The first task that we need to solve on the way to creating a search system for information on air tickets will be the selection of a suitable platform from which we will take information. The solution to this problem was not easy for me, but in the end I chose the Kayak service. I tried the services of Momondo, Skyscanner, Expedia, and some more, but the protection mechanisms against robots on these resources were impenetrable. After several attempts, during which, trying to convince the system that I was human, I had to deal with traffic lights, pedestrian crossings and bicycles, I decided that Kayak was the best for me, even though too many pages load in a short time, checks start too. I managed to make the bot send requests to the site in intervals of 4 to 6 hours, and everything worked fine. Periodically, difficulties arise when working with Kayak, but if you are beginning to pester with checks, then you need to either deal with them manually, then start the bot, or wait a few hours, and the checks should stop. If necessary, you can easily adapt the code for another platform, and if you do that, you can report it in the comments.

If you’re just starting to get familiar with web scraping, and don’t know why some web sites are struggling with it, then, before embarking on your first project in this area, do yourself a favor and search Google for word materials. "Web scraping etiquette". Your experiments may be completed sooner than you think, in the event that you will be engaged in web scraping unreasonable.

Getting Started


Here is a general overview of what will happen in the code of our web scraper:

  • Import the required libraries.
  • Open the Google Chrome tab.
  • Call the function that launches the bot, passing it the cities and dates that will be used when searching for tickets.
  • This feature gets the first search results, sorted by the criterion of greatest appeal (best), and presses a button to load additional results.
  • Another function collects data from the entire page and returns a data frame.
  • The two previous steps are performed using sorting types by ticket price (cheap) and flight speed (fastest).
  • An email is sent to the script user containing a brief summary of ticket prices (cheapest tickets and average price), and the data frame with information sorted by the three indicators mentioned above is saved as an Excel file.
  • All the above actions are performed in a loop after a specified time interval.

It should be noted that every Selenium project starts with a web driver. I use Chromedriver , I work with Google Chrome, but there are other options. Popularity still PhantomJS and Firefox. After downloading the driver, you need to place it in the appropriate folder, this ends the preparation for its use. In the first lines of our script, a new Chrome tab is opened.

Keep in mind that, in my story, I am not trying to open up new horizons in the search for lucrative flight deals. There are also much more advanced methods of searching for similar offers.I just want to offer the readers of this material a simple but practical way to solve this problem.

Here is the code we talked about above.

  from time import sleep, strftime
 from random import randint
 import pandas as pd
 from selenium import webdriver
 from selenium.webdriver.common.keys import keys
 import smtplib
 from email.mime.multipart import MIMEMultipart

 # Use your path to chromedriver here!
 chromedriver_path = 'C:/{YOUR PATH HERE}/chromedriver_win32/chromedriver.exe'

 driver = webdriver.Chrome (executable_path = chromedriver_path) # This command opens the Chrome window
 sleep (2)  

At the beginning of the code you can see the package import commands that are used throughout our project. So, randint is used to ensure that the bot would “fall asleep” for a random number of seconds before starting a new search operation. Usually, no bot can do without it. If you run the above code, the Chrome window will open, which the bot will use to work with sites.

Let's make a small experiment and open the kayak.com website in a separate window. We will choose the city from which we are going to fly, and the city we want to go to, as well as the dates of flights. When choosing dates, check to use the + -3 day range. I wrote the code, taking into account that the site gives in response to such requests. If you, for example, need to look for tickets only for a given date, then it is highly likely that you will have to refine the bot code. Talking about the code, I make an explanation, but if you feel confused, let me know.

Now click on the button to start the search and look at the link in the address bar. It should be similar to the link I use in the example below, where the kayak variable storing the URL is declared and the get method of the web driver is used. After clicking on the search button, results should appear on the page.


When I used the get command more than two or three times in a few minutes, I was asked to be tested using reCaptcha. This test can be passed manually and continue the experiments until the system decides to arrange a new test. When I tested the script, I had the feeling that the first search session always runs without problems, so if you want to experiment with the code, you only have to periodically manually check and leave the code to run, using long intervals between the search sessions. And if you think about it, a person is unlikely to need information about ticket prices received at 10-minute intervals between search operations.

Working with a page using XPath


So, we opened the window and loaded the site. In order to get pricing information and other information, we need to use XPath technology or CSS selectors. I decided to dwell on XPath and did not feel the need to use CSS selectors, but it is quite possible to work that way. Navigating the page using XPath can be difficult, and even if you use the methods I described in this article, where copying of relevant identifiers from the page code was used, I realized that this is actually not the best way to refer to the necessary elements. By the way, in this book you can find an excellent description of the basics of working with pages using XPath and CSS selectors. Here’s what the corresponding web driver method looks like.


So, we continue to work on the bot. We take advantage of the program to select the cheapest tickets. In the following image, the XPath selector code is highlighted in red. In order to view the code, you need to right-click on the element of the page that interests you and in the appeared menu select the command Inspect. This command can be called for different elements of the page, the code of which will be displayed and highlighted in the code view window.

View Page Code

In order to confirm my reasoning about the shortcomings of copying selectors from code, pay attention to the following features.

Here’s what happens when you copy the code:

 //* [@ id = "wtKI-price_aTab"]/div [1]/div/div/div [1]/div/span/span  

To copy something like this, right-click on a section of code of interest to you and select the command Copy & gt; Copy XPath.

Here is what I used to define the Cheapest button:

  cheap_results = ‘//a [@ data-code =" price "]’  


Copy & gt; Copy XPath

Obviously, the second option looks much simpler. When used, it searches for an element that has an data-code attribute equal to price . When using the first option, the id element is searched for wtKI-price_aTab , while the XPath path to the element looks like /div [1]/div/div/ div [1]/div/span/span . Such an XPath request to the page will do its job, but only once. I can say right now that id will change the next time the page loads. The sequence of wtKI characters dynamically changes each time the page loads, as a result, the code in which it is used will be useless after another page reload. So take some time to figure out the XPath. This knowledge will serve you well.

However, it should be noted that copying XPath selectors can be useful when working with fairly simple sites, and if this suits you, there is nothing wrong with that.

Now let's think about how to be if you want to get all the search results in several lines, inside the list. Very simple. Each result is located inside an object with the resultWrapper class. All results can be loaded in a loop that resembles the one shown below.

It should be noted that if you understand the above, then you should understand without any problems most of the code that we will parse. In the course of the work of this code, what we need (in fact, this is the element in which the result is wrapped), we address using some mechanism for specifying the path (XPath). This is done in order to get the text of the element and place it into an object from which data can be read (first flight_containers is used, then flights_list ).


The first three lines are displayed and we can clearly see everything we need. However, we also have more interesting ways of obtaining information. We need to take data from each element separately.

Get to work!


It is easiest to write a function to load additional results, so let's start with it. I would like to maximize the number of flights that the program receives information about, and at the same time not cause the service suspicions leading to testing, so I click the Load more results button once each time the page is displayed. In this code, you should pay attention to the try block, which I added due to the fact that sometimes the button does not load normally. If you also encounter this, comment out the calls to this function in the code of the start_kayak function, which we will discuss below.

  # Load more results in order to maximize the amount of data collected
 def load_more ():
  try:
  more_results = '//a [@class = "moreButton"]'
  driver.find_element_by_xpath (more_results) .click ()
  # Displaying these notes during the program helps me quickly figure out what she is doing
  print ('sleeping .....')
  sleep (randint (45,60))
  except:
  pass  

Now, after a long analysis of this function (sometimes I can get carried away), we are ready to declare a function that will be engaged in scraping the page.

I have already collected most of what I need in the following function, called page_scrape . Sometimes the returned data about the stages of the path are combined, for their separation I use a simple method. For example, when I first use the variables section_a_list and section_b_list . Our function returns the data frame flights_df , this allows us to separate the results obtained using different data sorting methods, and later combine them.

  def page_scrape ():
  "" "This part takes care of the scraping part" ""
 
  xp_sections = '//* [@ class = "section duration"]'
  sections = driver.find_elements_by_xpath (xp_sections)
  sections_list = [value.text for value in sections]
  section_a_list = sections_list [:: 2] # so we share information about two flights
  section_b_list = sections_list [1 :: 2]
 
  # If you stumble upon reCaptcha, you may need to do something.
  # The fact that something went wrong, you will learn based on the fact that the above lists are empty
  # this if expression allows you to shut down the program or do something else
  Here you can pause the work, which will allow you to pass the test and continue scraping
  # I use SystemExit here because I want to test everything from the very beginning
  if section_a_list == []:
  raise SystemExit
 
  # I will use the letter A for departing flights and B for arriving
  a_duration = []
  a_section_names = []
  for n in section_a_list:
  # We get time
  a_section_names.append (''. join (n.split () [2: 5]))
  a_duration.append (''. join (n.split () [0: 2]))
  b_duration = []
  b_section_names = []
  for n in section_b_list:
  # We get time
  b_section_names.append (''. join (n.split () [2: 5]))
  b_duration.append (''. join (n.split () [0: 2]))

  xp_dates = '//div [@ class = "section date"]'
  dates = driver.find_elements_by_xpath (xp_dates)
  dates_list = [value.text for value in dates]
  a_date_list = dates_list [:: 2]
  b_date_list = dates_list [1 :: 2]
  # We get the day of the week
  a_day = [value.split () [0] for value in a_date_list]
  a_weekday = [value.split () [1] for value in a_date_list]
  b_day = [value.split () [0] for value in b_date_list]
  b_weekday = [value.split () [1] for value in b_date_list]
 
  # Get prices
  xp_prices = '//a [@ class = "booking-link"]/span [@ class = "price option-text"]'
  prices = driver.find_elements_by_xpath (xp_prices)
  prices_list = [price.text.replace ('$', '') for price in prices if price.text! = '']
  prices_list = list (map (int, prices_list))

  # stops is a large list in which the first fragment of the path is located on an even index, and the second - on an odd
  xp_stops = '//div [@ class = "section stops"]/div [1]'
  stops = driver.find_elements_by_xpath (xp_stops)
  stops_list = [stop.text [0] .replace ('n', '0') for stop in stops]
  a_stop_list = stops_list [:: 2]
  b_stop_list = stops_list [1 :: 2]

  xp_stops_cities = '//div [@ class = "section stops"]/div [2]'
  stops_cities = driver.find_elements_by_xpath (xp_stops_cities)
  stops_cities_list = [stop.text for stop in stops_cities]
  a_stop_name_list = stops_cities_list [:: 2]
  b_stop_name_list = stops_cities_list [1 :: 2]
 
  #carrier company information, departure and arrival times for both flights
  xp_schedule = '//div [@ class = "section times"]'
  schedules = driver.find_elements_by_xpath (xp_schedule)
  hours_list = []
  carrier_list = []
  for schedule in schedules:
  hours_list.append (schedule.text.split ('\ n') [0])
  carrier_list.append (schedule.text.split ('\ n') [1])
  # we share time and carrier information between flights a and b
  a_hours = hours_list [:: 2]
  a_carrier = carrier_list [1 :: 2]
  b_hours = hours_list [:: 2]
  b_carrier = carrier_list [1 :: 2]

 
  cols = (['Out Day', 'Out Time', 'Out Weekday', 'Out Airline', 'Out Cities', 'Out Duration', 'Out Stops', 'Out Stop Cities',
  'Return Day', 'Return Time', 'Return Weekday', 'Return Airline', 'Return Cities', 'Return Duration', 'Return Stops', 'Return Stop Cities',
  'Price'])

  flights_df = pd.DataFrame ({'Out Day': a_day,
  'Out Weekday': a_weekday,
  'Out Duration': a_duration,
  'Out Cities': a_section_names,
  'Return Day': b_day,
  'Return Weekday': b_weekday,
  'Return Duration': b_duration,
  'Return Cities': b_section_names,
  'Out Stops': a_stop_list,
  'Out Stop Cities': a_stop_name_list,
  'Return Stops': b_stop_list,
  'Return Stop Cities': b_stop_name_list,
  'Out Time': a_hours,
  'Out Airline': a_carrier,
  'Return Time': b_hours,
  'Return Airline': b_carrier,
  'Price': prices_list}) [cols]
 
  flights_df ['timestamp'] = strftime ("% Y% m% d-% H% M") # data collection time
  return flights_df  

I tried to name the variables so that the code would be understandable. Remember that variables beginning with a belong to the first stage of the path, and b to the second. Moving on to the next function.

Auxiliary mechanisms


We now have a function that allows you to load additional search results and a function to process these results. This article could have been completed on this, since these two functions provide everything needed for scraping pages that can be opened independently. But we have not yet considered some auxiliary mechanisms, which were discussed above. For example - this is the code for sending emails and some more things. All this can be found in the start_kayak function, which we will now consider.

This function requires information about cities and dates. Using this information, she forms a link in the kayak variable, which is used to go to the page that will contain the search results, sorted by their best match to the query. After the first scraping session, we will work with the prices in the table at the top of the page.Namely, we find the minimum ticket price and the average price. All this, together with the prediction issued by the site, will be sent by e-mail. On the page the corresponding table should be in the upper left corner. Working with this table, by the way, can cause an error when searching using exact dates, as in this case the table is not displayed on the page.

  def start_kayak (city_from, city_to, date_start, date_end):
  "" "City codes - it's the IATA codes!
  Date format - YYYY-MM-DD "" "
 
  kayak = ('https://www.kayak.com/flights/' + city_from + '-' + city_to +
  '/' + date_start + '-flexible/' + date_end + '-flexible? sort = bestflight_a')
  driver.get (kayak)
  sleep (randint (8,10))
 
  # Sometimes a pop-up window appears, you can use the try block to check this and close it.
  try:
  xp_popup_close = '//button [contains (@id, "dialog-close") and contains (@class, "Button-No-Standard-Style close")]'
  driver.find_elements_by_xpath (xp_popup_close) [5] .click ()
  except Exception as e:
  pass
  sleep (randint (60,95))
  print ('loading more .....')
 
 # Load_more ()
 
  print ('starting first scrape .....')
  df_flights_best = page_scrape ()
  df_flights_best ['sort'] = 'best'
  sleep (randint (60,80))
 
  # Take the lowest price from the table at the top of the page
  matrix = driver.find_elements_by_xpath ('//* [contains (@id, "FlexMatrixCell")]')
  matrix_prices = [price.text.replace ('$', '') for price in matrix]
  matrix_prices = list (map (int, matrix_prices))
  matrix_min = min (matrix_prices)
  matrix_avg = sum (matrix_prices)/len (matrix_prices)
 
  print ('switching to cheapest results .....')
  cheap_results = '//a [@ data-code = "price"]'
  driver.find_element_by_xpath (cheap_results) .click ()
  sleep (randint (60.90))
  print ('loading more .....')
 
 # Load_more ()
 
  print ('starting second scrape .....')
  df_flights_cheap = page_scrape ()
  df_flights_cheap ['sort'] = 'cheap'
  sleep (randint (60,80))
 
  print ('switching to quickest results .....')
  quick_results = '//a [@ data-code = "duration"]'
  driver.find_element_by_xpath (quick_results) .click ()
  sleep (randint (60.90))
  print ('loading more .....')
 
 # Load_more ()
 
  print ('starting third scrape .....')
  df_flights_fast = page_scrape ()
  df_flights_fast ['sort'] = 'fast'
  sleep (randint (60,80))
 
  # Save a new frame in Excel-file, whose name reflects the city and date
  final_df = df_flights_cheap.append (df_flights_best) .append (df_flights_fast)
  final_df.to_excel ('search_backups//{} _ flights _ {} - {} _ from _ {} _ to _ {}. xlsx'.format (strftime ("% Y% m% d-% H% M")),
  city_from, city_to,
  date_start, date_end), index = False)
  print ('saved df .....')
 
  # You can follow how the forecast issued by the site correlates with reality
  xp_loading = '//div [contains (@id, "advice")]'
  loading = driver.find_element_by_xpath (xp_loading) .text
  xp_prediction = '//span [@ class = "info-text"]'
  prediction = driver.find_element_by_xpath (xp_prediction) .text
  print (loading + '\ n' + prediction)
 
  # sometimes in the loading variable this line appears, which, later, causes problems with sending the letter
  # if it happened - change it to "Not Sure"
  weird = '¯ \\ _ (ツ) _/¯'
  if loading == weird:
  loading = 'Not sure'
 
  username = 'YOUREMAIL@hotmail.com'
  password = 'YOUR PASSWORD'

  server = smtplib.SMTP ('smtp.outlook.com', 587)
  server.ehlo ()
  server.starttls ()
  server.login (username, password)
  msg = ('Subject: Flight Scraper \ n \ n \
 Cheapest Flight: {} \ nAverage Price: {} \ n \ nRecommendation: {} \ n \ nEnd of message '.format (matrix_min, matrix_avg, (loading + '\ n' + prediction)))
  message = MIMEMultipart ()
  message ['From'] = 'YOUREMAIL@hotmail.com'
  message ['to'] = 'YOUROTHEREMAIL@domain.com'
  server.sendmail ('YOUREMAIL@hotmail.com', 'YOUROTHEREMAIL@domain.com', msg)
  print ('sent email .....')  

I tested this script using an Outlook account (hotmail.com). I did not check it for correctness with the Gmail account, this mail system is very popular, but there are lots of possible options. If you are using a Hotmail account, then you just need to enter your data in the code in order for it to work.

If you want to figure out what exactly is performed in certain parts of the code for this function, you can copy them and experiment with them. Experimenting with code is the only way to understand it.

Ready System


Now that everything we have said is done, we can create a simple loop in which our functions are called. The script prompts the user for data on cities and dates. When testing with a constant restart of the script, you are unlikely to want to enter this data manually each time, so the relevant lines, for the time of testing, can be commented out by uncommenting those that go below them, in which the data needed by the script is rigidly set.

  city_from = input ('From which city?')
 city_to = input ('Where to?')
 date_start = input ('Search around for format date', please use YYYY-MM-DD format only ')
 date_end = input ('Return when? Please use YYYY-MM-DD format only')

 # city_from = 'LIS'
 # city_to = 'SIN'
 # date_start = '2019-08-21'
 # date_end = '2019-09-07'

 for n in range (0.5):
  start_kayak (city_from, city_to, date_start, date_end)
  print ('iteration {} was complete @ {}'. format (n, strftime ("% Y% m% d-% H% M")))
 
  # We are waiting for 4 hours
  sleep (60 * 60 * 4)
  print ('sleep finished .....')  

Here’s what the test run of the script looks like.

Test run the script

Totals


If you get to this point - congratulations! Now you have a working web scraper, although I already see many ways to improve it. For example, it can be integrated with Twilio so that, instead of e-mails, it sends text messages. You can use VPN or something else to simultaneously receive results from multiple servers. There is also a recurring problem with checking the user of the site on whether he is a person, but this problem can be solved. In any case, you now have a base that you can expand if you wish. For example, you can have the Excel file sent to the user as an email attachment.

<< a>

Source text: [Translation] Python - an assistant in finding cheap air tickets for those who love to travel