teh hayley

So here was my problem: there is a raw hurricane path file I need to download on a regular basis. Well, that's not the problem. The problem is that every single path prediction ends up in the same raw file. So say you get a hurricane that's lasted for 100 days... that's 100 days of raw data in one freaking file. Whereas, I only need the most recent data.

And this source isn't compressed. Well, I take that back. I haven't experimented with trying to tell the server that I accept gzip, so maybe I could get a smaller file that way.

Assuming they don't accept it though, I started wondering about doing partial http downloads. And I realized that resumable downloads must work on a similar principle.

So here's what I learned. The magic words are the "range header". This means you're talking to a server that will allow you to download a portion of a file.

Here's how you tell if the http server accepts it:

curl -I <full_url>
      

Then you're looking for this as a response:

Accept-Ranges: bytes
      

Or as a one-liner:

curl -sI <full_url> | grep -i range
      

-I will get you the headers. -s suppresses the strange stdout stuff that you'll see when piping curl to another command. The -i on the grep is because I'd rather type that than Range apparently (it's case insensitive you see).

Once you know that your server supports partial downloads, you can specify a range like so:

curl -H Range:bytes=<START>-<END> <full_url>
      

The END is optional, so you could do something like:

curl -H Range:bytes=1120928- <full_url>
      

And it would start from byte 1120928 onward to the end.

So how would you know where to start? Well, one option would be to do the curl -I and look for something like:

Content-Length: 1270935
      

And then decide how much of that file you want.

For me personally, I'm going to try to do some guesswork on approximately how many bytes the most recent hurricane path will be, then subtract that number from the total Content-Length and then only grab from that number to the end of the file (the most recent data is always at the end of these raw files).

I'm hoping to implement this in pure ruby code tomorrow, but I'm not afraid to use pure *nix commands if I have to (read: if I can't figure out how to do it in Ruby).

We'll see.


teh hayley