Have you ever wondered when signing into Google Talk, what might be happening under the hood? How can it tell you that your friend is typing just as she has, in fact, started typing? How does it manage to show all that real time presence information?

Well, one day I got really curious and decided to open it up! In this two part article, I share my adventures as I unravel the way Google Talk does what it is best at — Communication.

The basics

First of all, we must understand that Google Talk or any such communications client has to be just a socket program at its core. A socket program is a networking program which is usually targeted at a specific protocol. TCP/IP is the most widely used and supported communication protocol for the internet. Most of the protocols we come across in our mundane lives such as HTTP, FTP, SMTP etc, are all based on TCP/IP.

Identifying the protocol Google Talk uses

Our next task is to find out which protocol Google Talk actually uses. There are two ways to do this. The first one is to simply query Google itself for an answer. The second one is a little more fun. I shall stick to the second one.

There are a number of network tracing and analysis tools available on the internet. Some of these are very powerful and are capable of revealing a lot about protocols, TCP/IP packets and so on. The one I have chosen is called Ethereal. It is a neat tool that can analyze live TCP/IP packets. It is also an open source and free software.

After having set up Ethereal properly, I open the GTalk client and sign in while live capture of TCP/IP packets is in progress. Here's what I get.

 Screenshot of Ethereal protocol analyzer live trace showing Jabber packets. Observe rows numbered 120 and 122.

Screenshot of Ethereal protocol analyzer live trace showing Jabber packets. Observe rows numbered 120 and 122.

It's not hard to tell that Google Talk always connects to its server, talk.google.com. When we ping talk.google.com, we get its IP address, which resolves to 209.85.137.125 on my computer as of today. When we look back at the above trace, we find rows whose "Destination" column has a value of 209.85.137.125 (Observe rows numbered 120 and 124). If we observe the value of the "Source" column for these rows, we get 192.168.200.190. This happens to be my computer's IP address! Therefore we can conclude here that the rows we are looking at, are actually the TCP/IP packets sent from my computer to the Google Talk server, talk.google.com!

Similarly, we can find rows such as 122, 128 and 131 which are packets which talk.google.com has sent to my computer as responses to the requests sent by my computer. Now that we've identified the important packets, it's a simple task of just reading the value under the"Protocol" column and say that it is Jabber.

More about Jabber (and XMPP)

Quoting from the home of Jabber, www.jabber.org

Jabber is best known as "the Linux of instant messaging" an open, secure, ad-free alternative to consumer IM services like AIM, ICQ, MSN, and Yahoo. Under the hood, Jabber is a set of streaming XML protocols and technologies that enable any two entities on the Internet to exchange messages, presence, and other structured information in close to real time.

Jabber defines a host of sub-protocols, XMPP being the core of them. XMPP stands for eXtensible Messaging and Presence Protocol. In October 2004, it was adopted by the IETF community and the specification is available as RFC 3920.

To cover XMPP specifics is out of scope of this article. In summary, XMPP is an XML based communications protocol. This means all requests/responses happen through XML. The Google Talk client sends requests as XML messages to its server at talk.google.com and receives responses also as XML messages. In the next section, we shall see what forms the essence of XMPP while we try out some experiments.

Raw XMPP communication with talk.google.com

How about talking to the Google Talk server using its native language? Well, this is not as farfetched as it sounds. But before we attempt to do anything like that, we should understand the nature of the XMPP protocol.

XMPP defines two fundamental terms with respect to messaging — Streams and Stanzas. Here's how we can define them:

Stream: A Stream is an open XML envelope sent before exchanging more XML elements between two entities. These entities can be either the client or the server. These XML elements are known as stanzas as we learn in the next definition. Streams are always the root elements. They start with an optional XML Processing Instruction (Prolog) followed by an unterminated <stream:stream/> element. The Stream contains other information such as the server it is addressed to, the version of protocol used and various namespace declarations.

Stanza: A Stanza is a specific, well formed and complete XML element which either of the entities sends within an already open XML Stream. Stanzas are always the first level children in the XML document. XMPP Core defines three types of Stanzas viz. <presence/>, <message/> and <iq/>.

Entities can send any number of these Stanzas within an open Stream. All other information is sent as nested elements or attributes of these core Stanzas. Further details of these Stanzas are again beyond the scope of this article.

Now let's see what happens in a simple session of the client with the server. The following shows typical interaction between the client and the server. This and many more such examples can be found in the specification of XMPP Core, RFC 3920.

Client

<?xml version="1.0"?>
<stream:stream 
    to="example.com" 
    xmlns="jabber:client" 
    xmlns:stream="http://etherx.jabber.org/streams" 
    version="1.0">

Server

<?xml version="1.0"?>
<stream:stream 
    from="example.com" 
    id="someid" 
    xmlns="jabber:client" 
    xmlns:stream="http://etherx.jabber.org/streams" version="1.0">

... encryption, authentication, and resource binding ...

Client

<message from="juliet@example.com" to="romeo@example.net" xml:lang="en">
    <body>Art thou not Romeo, and a Montague?</body>
</message>

Server

<message from="romeo@example.net" to="juliet@example.com" xml:lang="en">
    <body>Neither, fair saint, if either thee dislike.</body>
</message>

Client

</stream:stream>

Server

</stream:stream>

Typical interaction.

By looking at the patterns of messages, we can tell that there are two separate XML documents involved here. The one which the client opens and terminates in the end and other one which the server opens and closes. However, during an interaction, these XML documents are interspersed.

Now let's try some talking with talk.google.com. XMPP will be our language and TCP/IP our medium.

To do a raw communication with any server in its native protocol, we need to be able to open a terminal session at a specific port on the server. We can use any of the available telnet clients such as Microsoft Telnet or PuTTY to do this. I chose PuTTY.

Let's first configure PuTTY to open a raw connection on talk.google.com at 5222 port. Note that 5222 is the non-SSL port Jabber uses. If we were to use 5223, which is the SSL enabled port, we would have difficulties doing our raw communication due to the encrypted nature of the medium.

Screenshot of PuTTY showing configuration to talk.google.com on port 5222

Screenshot of PuTTY showing configuration to talk.google.com on port 5222

Here's another screenshot of the actual raw XMPP communication we've been talking about till now. The first and third XML fragments are sent by the client (us) while the second and fourth are sent by the server, talk.google.com.

 Screenshot of PuTTY showing raw XMPP interaction with talk.google.com.

Screenshot of PuTTY showing raw XMPP interaction with talk.google.com.

We first initiate the stream with the "to" attribute of the Stream set to "gmail.com". The server then acknowledges the request by sending another Stream enumerating the features it supports and the method of encryption that it likes to be used. The <starttls xmlns="urn:ietf:params:xml:ns:xmpp-tls"><required/></starttls> element indicates that the server requires the client to acknowledge by sending another fragment, the <starttls xmlns="urn:ietf:params:xml:ns:xmpp-tls"/> element, indicating that it has accepted to start a TLS negotiation.

After this line, the server again acknowledges by telling the client to proceed with TLS negotiation. This is followed by an SASL negotiation. Here again, we reach our scope boundaries.

After the authentication phase, the client and the server can start exchanging XML Stanzas. We however, can't reach this stage using the raw communication approach with the Google Talk server as TLS negotiation and SASL handshake both require understanding of complex encryption mechanisms and the messages exchanged would no longer be human readable.

What next?

So far, we talked about identifying the protocols applications use; discovered that Google Talk is just another Jabber client. We learnt the basics of XMPP. We even successfully tried a preliminary raw XMPP communication with talk.google.com. Next, we shall advance a step further and see how we can exploit the wealth of features provided by XMPP to play with Google Talk!

To know more, read Fun with XMPP and Google Talk, Part 2.

References

Acknowledgements

They say, "The Early Bird Catches The Worm". Had it not been for friends who pointed out mistakes and helped me fix them soon after it was published, this article would still have been in a bad shape. My sincere thanks to my friends Amod Pandey, Bharati K and Hemanth H M.

Comments


comments powered by Disqus