Discussion:
TIdHTTP 10, how to set user agent and use Conditional Get??
(too old to reply)
Bo Berglund
2008-06-12 21:44:52 UTC
Permalink
I am writing an application to retrieve TV program data from an
internet site. The owners of the site put some requirements on how the
site data are fetched and one of them is stated in the quote below:

<quote>
All http-requests must include a User-Agent value that is unique to
this particular version of the grabbing application. The User-Agent
shall consist of an alphanumeric string that is unique for the
program, followed by "/" and an alphanumeric versionnumber.
Optionally, more information may be added with a space after the
version-number followed by an arbitrary string.
</quote>

User-Agent in Indy TIdHTTP?
---------------------------
I am using the latest version of Indy 10 (snapshot downloaded about 2
weeks ago) with Delphi 7 Pro. I am using TIdHTTP with a
TIdCompressorZLib attached as Compressor (thanks for that tip, Remy!).
I don't know how to set the user agent thing with TIdHTTP and I don't
have any success searching the Indy 10 helpfile either...

So my first question is:
How can I code the TIdHTTP component to supply the requested
User-Agent information???

Caching and conditional GET?
----------------------------
They also request that the data are only downloaded once using some
kind of caching system. For a description of this they are pointing to
an URL where fetching of RSS feeds is described:
(http://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers)
However, I am not sure I understand what this site is saying and also
I don't know if it is at all possible to implement using Indy 10
components.

Basically I should be able to save the downloaded files in a cache
directory and not get them again if their dates have not changed on
the server. But then I need a mechanism to ask the server for the date
of the file and compare to my saved date for the file I have
downloaded before.

Questions:
How can one ask for the timestamp of a file on the server?
And how can one make a Conditional GET?
And is there a caching system already implemented among the many Indy
components that I can use for this particular project?


/BoB
Remy Lebeau (TeamB)
2008-06-12 22:20:11 UTC
Permalink
Post by Bo Berglund
How can I code the TIdHTTP component to supply the
requested User-Agent information???
Assign the desired string value to the TIdHTTP.Request.UserAgent property.
Post by Bo Berglund
I don't know if it is at all possible to implement using
Indy 10 components.
Yes, it is.
Post by Bo Berglund
Basically I should be able to save the downloaded files in
a cache directory and not get them again if their dates have
not changed on the server.
There is slightly more involved than that.
Post by Bo Berglund
But then I need a mechanism to ask the server for the date
of the file and compare to my saved date for the file I have
downloaded before.
Read the Fishbowl article again:

"When you receive the RSS file from the webserver, check the response
header for two fields: Last-Modified and ETag. You don't have to care what
is in these headers, you just have to store them somewhere with the RSS
file.

"Next time you request the RSS file, include two headers in your
request.. Your If-Modified-Since header should contain the value you snagged
from the Last-Modified header earlier. The If-None-Match header should
contain the value you snagged from the ETag header.

"There's a temptation for clients to put their own date in the
If-Modified-Since header, instead of just copying the one the server sent.
This is a bad thing, what you should be sending back is exactly the same
date the server sent you when you received the file. There's two reasons for
this. Firstly, your computer's clock is unlikely to be exactly synchronised
with the webserver, so the server could still send you files by mistake.
Secondly, if the server programmer has followed this guide (see below),
it'll only work if you send back exactly what you received."
Post by Bo Berglund
How can one ask for the timestamp of a file on the server?
Perform a HEAD request and then look at the response header that is sent
back. But in this particular situation, you can ignore the HEAD request,
since your file will also contain a timestamp in its headers whenever you
download it.
Post by Bo Berglund
And how can one make a Conditional GET?
The Fishbowl article told you how. What that means in Indy terms is the
following:

var
LastMod: TDateTime;
ETag: String;
Data: TMemoryStream;
begin
if (file is in cache) then
begin
// retreive LastMod and ETag from the cache as needed ...
IdHTTP1.Request.LastModified := LastMod;
IdHTTP1.Request.CustomHeaders.Values['ETag'] := ETag;
end else
begin
IdHTTP1.Request.LastModified := 0.0;
IdHTTP1.Request.CustomHeaders.Values['ETag'] := '';
end;

Data := TMemoryStream.Create;
try
IdHTTP1.Get('http://yourURLhere', Data, [200, 304]);
if IdHTTP1.Response.ResponseCode = 200 then
begin
LastMod := IdHTTP1.Response.LastModified;
ETag := IdHTTP1.Response.Headers.Values['ETag'];
// store LastMod, ETag, and Data to cache as needed...
end;
finally
Data.Free;
end;
end;
Post by Bo Berglund
And is there a caching system already implemented among the
many Indy components that I can use for this particular project?
No. You are responsible for managing that on your end.


Gambit
Bo Berglund
2008-06-12 22:44:40 UTC
Permalink
On Thu, 12 Jun 2008 15:20:11 -0700, "Remy Lebeau \(TeamB\)"
Post by Remy Lebeau (TeamB)
Post by Bo Berglund
How can I code the TIdHTTP component to supply the
requested User-Agent information???
Assign the desired string value to the TIdHTTP.Request.UserAgent property.
Post by Bo Berglund
I don't know if it is at all possible to implement using
Indy 10 components.
Yes, it is.
Post by Bo Berglund
Basically I should be able to save the downloaded files in
a cache directory and not get them again if their dates have
not changed on the server.
There is slightly more involved than that.
Post by Bo Berglund
But then I need a mechanism to ask the server for the date
of the file and compare to my saved date for the file I have
downloaded before.
"When you receive the RSS file from the webserver, check the response
header for two fields: Last-Modified and ETag. You don't have to care what
is in these headers, you just have to store them somewhere with the RSS
file.
"Next time you request the RSS file, include two headers in your
request.. Your If-Modified-Since header should contain the value you snagged
from the Last-Modified header earlier. The If-None-Match header should
contain the value you snagged from the ETag header.
"There's a temptation for clients to put their own date in the
If-Modified-Since header, instead of just copying the one the server sent.
This is a bad thing, what you should be sending back is exactly the same
date the server sent you when you received the file. There's two reasons for
this. Firstly, your computer's clock is unlikely to be exactly synchronised
with the webserver, so the server could still send you files by mistake.
Secondly, if the server programmer has followed this guide (see below),
it'll only work if you send back exactly what you received."
Post by Bo Berglund
How can one ask for the timestamp of a file on the server?
Perform a HEAD request and then look at the response header that is sent
back. But in this particular situation, you can ignore the HEAD request,
since your file will also contain a timestamp in its headers whenever you
download it.
Post by Bo Berglund
And how can one make a Conditional GET?
The Fishbowl article told you how. What that means in Indy terms is the
var
LastMod: TDateTime;
ETag: String;
Data: TMemoryStream;
begin
if (file is in cache) then
begin
// retreive LastMod and ETag from the cache as needed ...
IdHTTP1.Request.LastModified := LastMod;
IdHTTP1.Request.CustomHeaders.Values['ETag'] := ETag;
end else
begin
IdHTTP1.Request.LastModified := 0.0;
IdHTTP1.Request.CustomHeaders.Values['ETag'] := '';
end;
Data := TMemoryStream.Create;
try
IdHTTP1.Get('http://yourURLhere', Data, [200, 304]);
if IdHTTP1.Response.ResponseCode = 200 then
begin
LastMod := IdHTTP1.Response.LastModified;
ETag := IdHTTP1.Response.Headers.Values['ETag'];
// store LastMod, ETag, and Data to cache as needed...
end;
finally
Data.Free;
end;
end;
Post by Bo Berglund
And is there a caching system already implemented among the
many Indy components that I can use for this particular project?
No. You are responsible for managing that on your end.
I will save the files as they are downloaded.
Post by Remy Lebeau (TeamB)
Gambit
Thanks for this piece of advice!
I will use this in my development for sure.

Another issue (a bit off-topic):
--------------------------------
I have an Apache server running on my development PC so I thought that
I should try to use that instead of the real one in order not to
distrub them. So I downloaded a few of the gz files from them and put
in my own server. THe problem I am seeing now is that my server seems
not to behave as the real server does when feeding the data to a
browser (FireFox).
Normal html and shtml files work just fine but when I try the gz files
I am offered by FireFox to save to disk or open with WinZip.
From the real server the file is actually downloaded *and expanded* by
firefox and then displayed as XML.

Do you have any idea on where I should start looking to modify my
Apache 2 webserver so that it will behave the same as the real site
for these gz files? (Just a shot in the dark, I know this is not an
Apache group, but it deals with testing anyhow...).


/BoB
Remy Lebeau (TeamB)
2008-06-12 23:26:34 UTC
Permalink
when I try the gz files I am offered by FireFox
to save to disk or open with WinZip.
Did you store the XML files using the .gz file extensions? If so, and if
the files are normal XML files, then you need to remove the .gz extensions.
From the real server the file is actually downloaded *and
expanded* by firefox and then displayed as XML.
Only because that server is configured to detect URLs with the .gz
extensions specified and auto-compress the files while serving them to the
browser. Obviously, your server is not configured to do the same thing.
Do you have any idea on where I should start looking
to modify my Apache 2 webserver so that it will behave
the same as the real site for these gz files?
I have no clue. I have never used Apache before. You will just have to
hunt through its documentation.


Gambit
Bo Berglund
2008-06-13 05:34:29 UTC
Permalink
On Thu, 12 Jun 2008 16:26:34 -0700, "Remy Lebeau \(TeamB\)"
Post by Remy Lebeau (TeamB)
when I try the gz files I am offered by FireFox
to save to disk or open with WinZip.
Did you store the XML files using the .gz file extensions? If so, and if
the files are normal XML files, then you need to remove the .gz extensions.
What I am talking about is not my application using TIdHTTP, because I
have not had time to prepare my test environment such tha it will
work. Instead I am manually downloading the gz file using FirFox
(rightclick-save as) from the original site and then I am getting the
files as truly compressed files (a 19 k file is about 2 k as gz).
So I had a few files in compressed format when I enabled Apache to
serve out the folder where they are kept and then I tested with FirFox
and got this behaviour.
So the files are actually gz and FireFox does not decompress them and
display as it does for similar URL:s from the original site.
I will have to see what TIdHTTP will do once I have set up te test
environment...
Post by Remy Lebeau (TeamB)
From the real server the file is actually downloaded *and
expanded* by firefox and then displayed as XML.
Only because that server is configured to detect URLs with the .gz
extensions specified and auto-compress the files while serving them to the
browser. Obviously, your server is not configured to do the same thing.
Well, the files I am having on my system are really compressed and
with the gz extension, so there must be something else as well. Maybe
the server must somehow tell the recipient that the file is a
compressed file too?
Post by Remy Lebeau (TeamB)
Do you have any idea on where I should start looking
to modify my Apache 2 webserver so that it will behave
the same as the real site for these gz files?
I have no clue. I have never used Apache before. You will just have to
hunt through its documentation.
OK, I will have to as in an Apache forum if I can find one...


/BoB
Remy Lebeau (TeamB)
2008-06-13 17:36:57 UTC
Permalink
Post by Bo Berglund
What I am talking about is not my application using TIdHTTP
I realize that.
Post by Bo Berglund
Instead I am manually downloading the gz file using FirFox
(rightclick-save as) from the original site and then I am getting
the files as truly compressed files (a 19 k file is about 2 k as gz).
So I had a few files in compressed format when I enabled Apache
to serve out the folder where they are kept and then I tested with
FirFox and got this behaviour.
So the files are actually gz and FireFox does not decompress them
If you are getting a 2K file saved, then FireFox is saving the compressed
data, not the uncompressed data. Apache is not setting the
'Content-Transfer-Encoding' response header, and transferring the .gz files
as-is. So there is nothing telling FireFox to decompress the files.
Post by Bo Berglund
Maybe the server must somehow tell the recipient that the file
is a compressed file too?
Yes. See above.


Gambit
Bo Berglund
2008-06-13 21:14:05 UTC
Permalink
On Fri, 13 Jun 2008 10:36:57 -0700, "Remy Lebeau \(TeamB\)"
Post by Remy Lebeau (TeamB)
Post by Bo Berglund
Instead I am manually downloading the gz file using FirFox
(rightclick-save as) from the original site and then I am getting
the files as truly compressed files (a 19 k file is about 2 k as gz).
So I had a few files in compressed format when I enabled Apache
to serve out the folder where they are kept and then I tested with
FirFox and got this behaviour.
So the files are actually gz and FireFox does not decompress them
If you are getting a 2K file saved, then FireFox is saving the compressed
data, not the uncompressed data. Apache is not setting the
'Content-Transfer-Encoding' response header, and transferring the .gz files
as-is. So there is nothing telling FireFox to decompress the files.
No, if I just enter the URL into the address field of FireFox (with
the gz ending, then FF will display the expanded xml contents.
But if I have a link to the file (I made a small html file with the
links to the files on the server) and right-click it I can select
"Save Link As..". This makes FireFox download the file to my
destination *without* expanding it. I now have the compressed file on
disk as xml.gz. I can now yse 7Zip to decompress this file to the xml
file, 2K size now becomes 15-20 k.
Post by Remy Lebeau (TeamB)
Post by Bo Berglund
Maybe the server must somehow tell the recipient that the file
is a compressed file too?
Yes. See above.
I am now discussing on the Apache NG how to configure the server to
supply the needed info such that clients like FireFox know how to
decompress the data.
So far no solution though...


/BoB
Remy Lebeau (TeamB)
2008-06-13 22:22:25 UTC
Permalink
Post by Bo Berglund
No, if I just enter the URL into the address field of FireFox
(with the gz ending, then FF will display the expanded xml
contents.
That can only hapen if the server is sending the 'Content-Transfer-Encoding'
response header to tell FireFox to decompress the data and display the
uncompressed contents.
Post by Bo Berglund
But if I have a link to the file (I made a small html file with the
links to the files on the server) and right-click it I can select
"Save Link As..". This makes FireFox download the file to my
destination *without* expanding it.
Then the server is not sending the 'Content-Transfer-Encoding' response
header, so FireFox does not know to decompress the data and will save it
as-is.


Gambit
Bo Berglund
2008-06-13 21:08:21 UTC
Permalink
On Thu, 12 Jun 2008 15:20:11 -0700, "Remy Lebeau \(TeamB\)"
Post by Remy Lebeau (TeamB)
Post by Bo Berglund
How can I code the TIdHTTP component to supply the
requested User-Agent information???
Assign the desired string value to the TIdHTTP.Request.UserAgent property.
Done.
Post by Remy Lebeau (TeamB)
"When you receive the RSS file from the webserver, check the response
header for two fields: Last-Modified and ETag. You don't have to care what
is in these headers, you just have to store them somewhere with the RSS
file.
Don't know how to "check the headers"....
Post by Remy Lebeau (TeamB)
"Next time you request the RSS file, include two headers in your
request.. Your If-Modified-Since header should contain the value you snagged
from the Last-Modified header earlier. The If-None-Match header should
contain the value you snagged from the ETag header.
Don't understand this "If-None-Match" is all about.
Post by Remy Lebeau (TeamB)
Post by Bo Berglund
How can one ask for the timestamp of a file on the server?
Perform a HEAD request and then look at the response header that is sent
How do I "look at response headers"???
Post by Remy Lebeau (TeamB)
Post by Bo Berglund
And how can one make a Conditional GET?
The Fishbowl article told you how. What that means in Indy terms is the
var
LastMod: TDateTime;
ETag: String;
Data: TMemoryStream;
begin
if (file is in cache) then
begin
// retreive LastMod and ETag from the cache as needed ...
OK, so I will store these values together with the cached file in my
own way.
Post by Remy Lebeau (TeamB)
IdHTTP1.Request.LastModified := LastMod;
IdHTTP1.Request.CustomHeaders.Values['ETag'] := ETag;
end else
begin
IdHTTP1.Request.LastModified := 0.0;
IdHTTP1.Request.CustomHeaders.Values['ETag'] := '';
end;
So the above prepares the GET request by storing the LastMod and ETag
values into the IdHTTP1 object?
Post by Remy Lebeau (TeamB)
Data := TMemoryStream.Create;
What does this do? How can I later get hold of the data from the
server? I usually use the GET method result stored into a string...
Post by Remy Lebeau (TeamB)
try
IdHTTP1.Get('http://yourURLhere', Data, [200, 304]);
Where did the 200,304 come from? What are these numbers?
Post by Remy Lebeau (TeamB)
if IdHTTP1.Response.ResponseCode = 200 then
??? 200 ???? Does it mean that the Get succeeded and has downloaded a
file?
Post by Remy Lebeau (TeamB)
begin
LastMod := IdHTTP1.Response.LastModified;
ETag := IdHTTP1.Response.Headers.Values['ETag'];
// store LastMod, ETag, and Data to cache as needed...
How do I transform Data to a file?
What is the difference between a string and TMemoryStream??
Post by Remy Lebeau (TeamB)
end;
finally
Data.Free;
end;
end;
So far I have come here (before departing on the caching system):

I have this in my constructor:
constructor TXmlTVGrabber.Create;
begin
FHTTP := TIdHTTP.Create;
FIdCompr := TIdCompressorZLib.Create;
FHTTP.Compressor := FIdCompr;
FHTTP.Request.UserAgent := 'xmltv2mei/1.0.1'; <== New
...
end;

and in the download method (cut away irrelevant code):
var
slTmp: TStringList;
Resp: string;
URL: string;
dtDate: TDateTime;
i, n: integer;
sTmp: string;
begin
slTmp := TStringList.Create;
try
URL := FBaseURL + ChannelName + '_' + FormatDateTime('yyyy-mm-dd',
Now) + '.xml.gz';
try
Resp := FHTTP.Get(URL); <== this is where I need a change?
slTmp.Text := Resp;

So what I do is that I receive the data using the Get method into the
string Resp. Then I put it into a stringlist to process the individual
lines.


/BoB
Remy Lebeau (TeamB)
2008-06-13 22:37:04 UTC
Permalink
Post by Bo Berglund
Don't know how to "check the headers"....
I already showed you a code snippet demonstrating exactly how to do that.
Go look at my earlier reply again.
Post by Bo Berglund
Don't understand this "If-None-Match" is all about.
The very top of the Fishbowl article has this disclaimer:

"This article presumes you are familiar with the mechanics of an HTTP
query, and understand the layout of request, response, header and body."

I guess you did not read up on how the HTTP protocol actually work yet.
Post by Bo Berglund
Post by Remy Lebeau (TeamB)
if (file is in cache) then
begin
// retreive LastMod and ETag from the cache as needed ...
OK, so I will store these values together with the cached file in my
own way.
Yes.
Post by Bo Berglund
So the above prepares the GET request by storing the LastMod
and ETag values into the IdHTTP1 object?
Yes.
Post by Bo Berglund
Post by Remy Lebeau (TeamB)
Data := TMemoryStream.Create;
What does this do?
Are you joking?
Post by Bo Berglund
How can I later get hold of the data from the server?
The TStream that is being passed to Get() will contain all of the file data
after Get() exits, provided the GET actually downloaded anything.
Post by Bo Berglund
I usually use the GET method result stored into a string...
Get() supports outputting to either a String or a TStream.
Post by Bo Berglund
Post by Remy Lebeau (TeamB)
IdHTTP1.Get('http://yourURLhere', Data, [200, 304]);
Where did the 200,304 come from? What are these numbers?
They are HTTP response codes. 200 means a request was successful and data
was returned. 304 means a conditional GET was successful but no data was
returned because the file had not been changed since the last download
(sense the use of the LastModified and ETag values). By passing the numbers
to Get() like I showed, they are telling Get() which response code are not
errors. This is because 304 is normally treated as an error by Indy, so the
code is telling Get() that 304 is ok and not an error.
Post by Bo Berglund
Post by Remy Lebeau (TeamB)
if IdHTTP1.Response.ResponseCode = 200 then
??? 200 ???? Does it mean that the Get succeeded and has
downloaded a file?
Yes.
Post by Bo Berglund
How do I transform Data to a file?
TMemoryStream has a SaveToFile() method.
Post by Bo Berglund
Resp := FHTTP.Get(URL); <== this is where I need a change?
slTmp.Text := Resp;
If the server returns a 304 reply, then Resp will be empty since no data was
returned.

Since you are downloading XML, you can technically continue to use String
instead of TStream as the data storage. But if the XML is compressed on the
server and not uncompressed correctly during transfer, you would end up with
binary data in your String instead of XML. That is why I suggest using
TMemoryStream instead. Then you can look at the server's response headers
and decompress the TMemoryStream manually if needed.


Gambit
Bo Berglund
2008-06-14 08:24:49 UTC
Permalink
On Thu, 12 Jun 2008 15:20:11 -0700, "Remy Lebeau \(TeamB\)"
Post by Remy Lebeau (TeamB)
ETag := IdHTTP1.Response.Headers.Values['ETag'];
Won't go through the compiler. Complains that there is no such thing
as Response.Headers...
Did you mean Response.CustomHeaders????


/BoB
Remy Lebeau (TeamB)
2008-06-16 17:44:33 UTC
Permalink
Post by Bo Berglund
Won't go through the compiler. Complains that there is no
such thing as Response.Headers...
Use RawHeaders then:

ETag := IdHTTP1.Response.RawHeaders.Values['ETag'];
Post by Bo Berglund
Did you mean Response.CustomHeaders????
No. CustomHeaders only applies to requests, not responses.


Gambit
Bo Berglund
2008-06-18 20:49:45 UTC
Permalink
On Mon, 16 Jun 2008 10:44:33 -0700, "Remy Lebeau \(TeamB\)"
Post by Remy Lebeau (TeamB)
ETag := IdHTTP1.Response.RawHeaders.Values['ETag'];
Worked fine! :-)


I have been adviced by the people running the site with the data to
use something called "Persistent connections" in order to speed up the
transfers.
But I don't find a property for TIdHTTP called this so I guess that it
hides somewhere else (if it exists).
Do you know what he meant?
(The system works now albeit a bit slower than I want it to work)


/BoB
Remy Lebeau (TeamB)
2008-06-19 17:29:49 UTC
Permalink
Post by Bo Berglund
I have been adviced by the people running the site with the data
to use something called "Persistent connections" in order to
speed up the transfers.
Set the TIdHTTP.ProtocolVersion to pv1_1, or the the
TIdHTTP.Request.Connection property to 'keep-alive'. HTTP 1.1 uses
persistent connections by default, whereas HTTP 1.0 does not.


Gambit

Loading...