Finding Multi-Stop Flights Using Neo4j & Python š Part #1
As youād probably figured out by now, traveling is a big passion of Markus and mine. However, as students, our travel budget has some serious constraints.
Thatās why I started to learn about some great and fun ways to save big bucks while still seeing the world. Iād learned to master the Miles and Points game, Fuel Dumping and Hidden City Ticketing.
For a while, Iād been doing my research manually with tools like ITA Matrix or a standard web browser on sites like Momondo or Kayak. Which is frankly speaking quite tedious and time-consuming. Itās okay if you have an itinerary with only a few stops, but not very convenient if itās anything bigger or even an around the world itinerary.
My idea
I had the idea of creating a web app that allows you to enter a few destinations and your timeframe. The application should combine those airports in a very cost efficient manner and yield a list of possible flights. I wanted the web app to be able to also include hidden city tickets and multi-stop itineraries with mixed tickets (like mixing a RyanAir with a Delta Ticket). Another thing I wanted to try out for almost forever is Neo4j; a Graph Database. The hype about Base-Databases also left its marks on me. So I was desperately looking for a way where I could use a graph database⦠And the rest is history.
My planned Technology Stack
I havenāt really settled on my whole technology stack. As mentioned earlier, Iām currently using Neo4j as Database. The reason why Iād settled on a graph database in comparison to a relational database is thatās far easier to check for the shortest path and also much faster. For populating the database I am currently using a python script. I havenāt really settled on the frontend nor the middleware (like a tiny API)ā¦
Airline Routes
āAirplane, Airplane sorry I am lateā¦š¶ā. OK, I gonna stop now. To get a better understanding of the route network - offered by airlines from around the world- I started off with visualizing the airline routes in Tableau. I got the data from Openflights.org, a platform which lets you track flights youād taken and has some neat statistics like your northern most airport or your longest flight. Openflights.org publishes a list of Routes as open data which can easily be imported into Tableau. Below you can see a complete visualization of the routes on Openflights. Pretty impressive, Ayh?
Looking for something leaner? Here is a route network of Alaska Airlines (AS) and Virgin America (VX).
And hereās one of Lufthansaās pretty impressive network.
The Lufthansa Network brings me to a point worth mentioning. The data from openflights.org is sometimes not the best. Some flights have been entered incorrectly. Just look at those random flights in South America and Africa. As far as I am aware Lufthansa doesnāt have that man 5th Freedom flights.
In total itās a bit more than 36000 point-to-point connections if you donāt take the operating airline or code-share flights into account, just the fact that a connection exists between two airports.
You may wonder why I need those flights at all? My plan was to do an initial search of all possible combinations (the ones from Openflights) and add those to my neo4j database. Iāll later consult the neo4j database for a first best guess and query the exact prices in real time through an API.
A first glimpse on the schema of the Neo4j database
I have to admit when I started with this project I had absolutely no clue about how this DB, the query language, and concept works. After playing around for a while I soon felt more comfortable.
Coming up with a bulletproof database schematic is just as important as with a traditional relational database. In my first attempt, I entered all the flights as constraints and added the date, price as properties. While Iād worked for a few flights it soon got a mess. Itās also against the codex of neo4j to have hundreds of relationships between two nodes.
Thatās why I went back to the scratchboard and did some further research. I actually found a pretty cool site by Philippe Khin, in his posts heād described the pros and cons of various models. I decided to pick up the idea of having separate nodes for months, days, airports and flights. That way I can fairly easily find out if there is a flight from a specific date and see where itās going. Also, the runtime of the query is pretty impressive. Below you can see what it looks like, once some data has been added to the database.
As far the properties go I decide to take a slightly different approach than Phillipe.
My Airports have the following properties:
ā
{
ānameā: āAtmautluak Airportā,
ācountryā: āUSā,
ācodeā: āATTā,
ālngā: ā-162.272995ā,
ālatā: ā60.866699ā
}
While the flight nodes are fairly simple
ā
{
āmonthā: 6,
ādayā: 1,
āflight_numberā: āF8310ā
}
The actually interesting properties are assigned to the TO-relationships
ā
{
ādurationā: 2460,
āarrival_timeā: 1527887460,
ādistanceā: 287.54,
āmonthā: 6,
āpriceā: 37,
ādayā: 1,
ādeparture_timeā: 1527885000
}
It contains the arrival and departure time formatted as Unix timestamp and most importantly the price in EUR. Iāve also added the distance as a property, that way I can fairly easy search for long-haul flights which have a good price per distance ratio.
The query Iād executed above gives back all possible combinations between two specified airports Whitehorse (YXY)
and Ningbo (NGB)
on any day in June. As you can see, you could connect in Vancouver (YVR)
and Shenyang (SHE)
and fly directly from there to NGB. Or you could take a little detour through Chengdu (CTU)
and Hainan (HAK)
. The red circles represent the specific flight numbers. This is just a visualization of one specific trip, though there many more possible combinations.
Thatās it for this post. In the next post, Iāll be talking about which flight search APIs are freely available and how I imported the data into neo4j using a python script.
Cheers Alex