Sunday, February 12, 2012

My First Look at Google Research

There was another article in the NYT today about the growing importance of quant jocks to analyze the blizzard of data the cloud is generating. Apparently data analyst is now the hot new job title out there. Makes me feel good I'm going back to my quant roots.

For a quant, its all about the data. Given the cost of getting new data, if there are stores of data that I can farm from, it allows me to get a jump start on refining my thesis. Tonight I am looking at Google Research to see what I find. So far I have not found references to data but in a slide presentation (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/stanford-295-talk.pdf) I see a reference to the system qualities they value;
– Simplicity
– Scalability
– Performance
– Reliability
– Generality
– Features

This is as good a list as any to define the attributes people value that are divorced from the functional aspect of the product. (although I must confess I'd be guessing what they mean by Generality, and of course features is another word for functionality IMHO)

This looks like something that could be helpful...


  • L1 cache reference 0.5 ns
  • Branch mispredict 5 ns
  • L2 cache reference 7 ns
  • Mutex lock/unlock 100 ns
  • Main memory reference 100 ns
  • Compress 1K bytes with Zippy 10,000 ns
  • Send 2K bytes over 1 Gbps network 20,000 ns
  • Read 1 MB sequentially from memory 250,000 ns
  • Round trip within same datacenter 500,000 ns
  • Disk seek 10,000,000 ns
  • Read 1 MB sequentially from network 10,000,000 ns
  • Read 1 MB sequentially from disk 30,000,000 ns
  • Send packet CA->Netherlands->CA 150,000,000 ns
I find these slides interesting...

Source Code Philosophy
• Google has one large shared source base
– lots of lower-level libraries used by almost everything
– higher-level app or domain-specific libraries
– application specific code
• Many benefits:
– improvements in core libraries benefit everyone
– easy to reuse code that someone else has written in another context
• Drawbacks:
– reuse sometimes leads to tangled dependencies
• Essential to be able to easily search whole source base
– gsearch: internal tool for fast searching of source code
– huge productivity boost: easy to find uses, defs, examples, etc.
– makes large-scale refactoring or renaming easier


Software Engineering Hygiene
• Code reviews
• Design reviews
• Lots of testing
– unittests for individual modules
– larger tests for whole systems
– continuous testing system
• Most development done in C++, Java, & Python
– C++: performance critical systems (e.g. everything for a web query)
– Java: lower volume apps (advertising front end, parts of gmail, etc.)
– Python: configuration tools, etc.

Multi-Site Software Engineering
• Google has moved from one to a handful to 20+ engineering sites
around the world in last few years
• Motivation:
– hire best canidates, regardless of their geographic location
• Issues:
– more coordination needed
– communication somewhat harder (no hallway conversations, time zone
issues)
– establishing trust between remote teams important
• Techniques:
– online documentation, e-mail, video conferencing, careful choice of
interfaces/project decomposition
– BigTable: split across three sites



Something else I found at Google Research was Google Correlate. Type in a term and it will find other terms whose search pattern matches. Try to correlate by time-series or geography. Kinda cool...

  • thesis has a clear seasonal pattern peaking in fall and spring and matching the term factor ( United States Web Search activity for thesis and factors (r=0.9628) )
  • The term "data analyst" has been trending up since 2008 after having been level in the period 2004 to 2008. It correlates with these other terms with p ranging from 0.9390 to 0.9218:
    • pain management
    • biotin
    • ignore
    • hiring manager
    • coordinator salary
    • spondylosis
    • psychiatric nurse
    • how to answer
 

1 comment: