Is there an API to access Matrix? I was under the impression that there was not — indeed, that Matrix was built on top of ITA's commercial QPX API.
Seems so. Have a look with Firebug or Chrome's Developer Tools at what the browser receives before the matrix results are rendered.
I've been using burp: http://portswigger.net/burp/proxy.html which is a really nice tool to look at this kind of stuff
As HaveMilesWillTravel suggested, set up something to monitor your traffic as you use the matrix (e.g. live http headers extension for firefox). Notice what gets sent when you click search, and right before the results load. Most of JSON is fairly self-documenting. So at this point it's just a matter of picking your favorite library to send/parse messages.
If someone wants to invite me in to the group I'll be happy to share some of my experiences in scraping a dozen or so flight inventory, schedule and fare systems.
Depending on the goal of the project (which I can't quite identify from the thread) I may be able to help. I mostly use PERL, and do a lot of screen scraping for my own projects. If you go with another language I won't be of much help.
Hi edlin303, I'm sure someone will be around to invite you soon. Can't do it myself I'm afraid. I'm a perl person too, so that sounds good to me. However, I've just installed this: http://conkeror.org/ which has interesting possibilities. It's a programmable fully-featured browser which might make some tasks easier than writing whole scripts to do every part of the transaction. WA is already in, yes.
There are pros/cons to any data-structure and it ultimately depends on what you're trying to accomplish. Finding a ranking of cheapest travel on UA from SEA on a specific date may warrant a different design than say modeling inventory pricing as you get closer to departure. Figure out what problem you want to solve. Plus remember nothing is perfect (e.g. in the cheapest travel out of SEA example - I might just record destination and price. Presumably if I want to know how to get from SEA to BNA for 140, I can re-run the query later myself.) With that said, you can spend some time to write up something resembling the request and responses. Then when you have a particular problem, write some automation to send your set of requests and some method to get relevant tidbits out of the responses.
My other projects have nothing to do with travel, but I doubt this would be much different of a task. I have about a dozen sites I am scraping right now. I will admit I use what I would describe as a caveman programming technique. I keep throwing everything including the kitchen sink at the problem until I knock it over, then often leave everything as is until I need to do it again. I haven't had to do much collaborative programming, so I am not sure my style will mesh well with others. I have been trying to force myself to use modules for almost everything lately though, so it might be good practice. As okrogius points out, each task has its own needs. For some projects I scrape an entire site into a MySQL DB and then do post-processing from there. For others I have something I want to find and I scrape 5-10 different sites for it and just return the results. Without knowing what we are talking about here, it's hard to speculate on what approaches would be most useful. If I see value in it for me (such as if we're talking about looking for FDs or deals from a certain city which I could use) I have both resources and bandwidth I can contribute to the cause. *re-reading the OP, it sounds like we're talking about a real-time type of tool. One thing I saw mentioned was awards, so I could see value in something that can query multiple airlines for awards and return dates, mile requirements, etc. For something like that I would imagine it would be best to settle on a common language, and anyone who knows that language can chip in for some modules. For example, I could look into a module for AA awards, someone else could do DL, etc. Then with a standard set of conventions we could have it so each module accepts the same variables "username,pass,origin_city,destination_city,etc" and returns the same format output. The parent script could loop through all modules the end user selected.
I'd like to request an invite to the group too. I've got a ton of interest for this stuff, and although I haven't had a lot of time lately, I do have some experience with scraping with Python/PHP.
I just saw this thread and wanted to mention that I already have a Perl/SQLite based tool (used different sources and then focused on one which ceased to work shortly before Christmas). Once I'm back from the SIN-DO I'll start hacking a new scraper. We already have a few people interested and I believe we should join forces. I would be happy to join and can show code if needed.
That sounds familiar. I just modified one of your (I presume) scripts to print out airport pairs that are in different regions sorted by distance apart. Thought it might come in handy Hopefully someone will be around to invite you and nomflyer in shortly.
Yaffa, one of the authors was at sin DO. Im mobile right now but I'll invite you. PM me if I haven't by tmrw.
Here's what I've got so far, written in perl: # Find (f)lights in next 30 days between FCO and JFK $ ./ft -f FCO-JFK price: EUR493.51 miles: 8740 ppm: EUR0.0564073227 carrier: AA code: OL7E6J carrier: AA code: QL7E6J YQ EUR240.00 price: EUR548.51 miles: 8740 ppm: EUR0.0627002289 carrier: AA code: OL7E6J carrier: BA code: NLE2EU YQ EUR120.00 YQ EUR120.00 ... # Tell me about (a)irport SYD $ ./ft -a SYD SYD,airport,Australia/Sydney,-33.946111,151.177222,Sydney Kingsford Smith; Australia (SYD) # Tell me about airports/locations (n)ear EWR within a (r)adius of 15 miles $ ./ft -n EWR -r 15 EWR,airport,40.7166667,-74.166667,New York Newark Liberty Int'l, NJ (EWR),0 ZRP,helipad,40.7230556,-74.160833,Newark Railway Station, NJ (ZRP),0 TEB,airport,40.85,-74.060833,Teterboro, NJ (TEB),10 CDW,airport,40.8666667,-74.283333,Caldwell Wright, NJ (CDW),12 # same in (m)etric $ ./ft -mn EWR -r 20 EWR,airport,40.7166667,-74.166667,New York Newark Liberty Int'l, NJ (EWR),0 ZRP,helipad,40.7230556,-74.160833,Newark Railway Station, NJ (ZRP),0 TEB,airport,40.85,-74.060833,Teterboro, NJ (TEB),17 CDW,airport,40.8666667,-74.283333,Caldwell Wright, NJ (CDW),19 # (g)reat circle distance between JFK and HKG $ ./ft -g JFK-HKG 8050 # How many BA miles and (t)ier points with a gold card on AA between JFK-HKG in J $ ./ft -t JFK-HKG-AA-Gold-J JFK,HKG,AA,Gold,J,18151,160 All the data is cached in a local database (except for flights at the moment), and the cache is searched first before going out to the remote site if not found.
I I'd like to participate in writing script. If someone wants to invite me I'll be happy to share some of my experiences in screen scrapping
Hi, due to the way "conversations" work here, I can't actually invite anyone myself. Hopefully someone will invite you later. What languages do you use?
+1 on sites with JSON. Farecompare's tools aren't too difficult to work out, but it is a bit of an exercise to do so. You just need a good library for parsing it and can run your own queries.
Hey all, I've been toying with rolling my own script (in Perl) to crunch the data. It's functional in hitting ITA, plus it draws on a MySQL database of all airports, airlines, and routes (imported from http://www.openflights.org/data.html). At this point, it doesn't do a whole lot and most of the options are hardcoded, so I'm pretty sure it's MILES behind what most of y'all have done over the past years. I don't know how/if I could help, but I'd love to throw my hat in the ring and maybe see what others have done. I'm a firm believer in standing on the shoulders of giants (i.e., building on other people's work), but I'm also a fan of DIY solutions to really understand the process. All of that said, I'd love to help and learn from you all, so please let me know if there's still a place for me to do that here. Thanks!
Hi elephantart, our project has taken off and we have a bunch of scripts, lots of data, a mailing list, and a few people who are collaborating. Obviously it's a lot easier saying you're interested than it is to actual do something, so at the moment you'd have to actually show something you've done to be involved. For me, perl , CouchDB, WWW::Mechanize and HTML::Grabber seem to solve 95% of the things I'm interested in.
I was planning to build in a messaging system that would allow me to deploy a query in parallel to N servers to hit ITA (or whatever) without worrying about running into IP-based restrictions. But at that point, I'd probably have about 95% of the things I'm interested in solved as well. Still, it'd just be nice to compare notes and tricks; I'll let you tell me if that's part of the 5% you might still be interested in.