Database Architecture

What database architecture does Google and Yahoo use?

Find out about Big Data and Big Table?

Discuss the main reasons for choosing such an architecture.

8 thoughts on “Database Architecture

  1. mykh74

    Both Google and Yahoo use NOSQL database architecture for their search engines. NOSQL movement started when Google launched BigTable database system. Google BigTable was supplemented by newer cousins such as Hadoop, Cassandra, MongoDB, and so on.

    Main reason why Google, Yahoo, Amazon, and other large Web companies do not use Relational DBMS is because a traditional database is slow in processing of large amount of data. Google was the first company that created entirely new database architecture for processing large amount of data – BigTable . The database is called BigTable because all data are saved in one huge single table. The table is automatically distributed among available server space which can be counted by thousands computers. BigTable has much better load-capacity than RDBMS. Structure of BigTable is a multidimensional, distributed, sorted map. Map is indexed by row, column, and timestamp. Timestamp allows to select cells by version or to make delete operation of old cell versions, for example, old page contents. In order to get information from BigTable, Google uses MapReduce framework, a special framework for processing data sets stored across large number of distribution systems. MapReduce sends calculation logic to each chunk of BigTable on each server which makes possible to get almost instantaneous result. Term “Map” means that the master node takes input request, divide it into smaller problems, and distribute them to worker nodes. Term “Reduce” means that the master node receives responses from worker nodes, combine them to form output. Because of the structure of BigTable, the database does not support rich-SQL queries.

    Yahoo uses Apache Hadoop, open source framework written in Java similar to Google database architecture. Core of the framework was modeled based on Google BigTable database. According to Wikipedia, the Yahoo Search engine is a Hadoop application which runs on more than 10,000 Linux servers.

    BigData is collections of data sets with sizes so large that it is difficult to process them using traditional tools and applications. Big Data is a concept that becomes important with growing stored data around the world. New tools for traditional databases and new database engines such as NoSQL are developed to make possible to process Big Data data sets.

  2. rahul

    Google seems to be using three main data bases – Bigtable, MegaStore and Spanner for its different applications.
    Bigtable is basically a noSQL database consisting of rows and columns with multiple versions based on a time stamp. This database schema doesn’t provide real time concurrency but eventual concurrency only which is enough for some applications like google earth, personalized search. It also provides features like compression, commit logs which are helpful in case of failovers and is distributed across multiple servers called tablets.
    Megastore is a database based on relational Database which is highly scalable and provides strong consistency and availability guarantees. This data structure provides full ACID properties and support SQL like semantics and is used for applications like GMail, Calendar, Picassa. This is optimized for reads and writes have poor throughput.
    Spanner is the latest database developed by Google. It is described as a globally-distributed database. It uses a global time stamp which is used to create multiple versions of data like in Bigtable. It also supports the relation data model similar to MegaStore. Spanner tries to prevent the consistency issues of Bigtable when distributed data is used by providing local servers. Google uses this in their advertising backend called F1.

    Yahoo uses a databases system called PNUTS which is a relational database which is designed with a goal of low read time, high availability, fault tolerance, scalability and relaxed consistency guarantees. It supports both a hash based and ordered table for storing data which can be used based on the application. Example uses include user database, social applications.

    Big Data describes huge sets of data which can be structured or unstructured. Big Table is a designed by Google to handle Big Data of peta bytes of size. Data from scientific research area like the human genome project, Large Hadron Collider etc. are examples of big data.

  3. Vinay Venkatachala

    Google GFS – Google uses Google File System (GFS). It is a distributed file system developed by Google. Key ideas on which this has been developed are –
    • Fault tolerance
    • Ability to run on inexpensive commodity hardware
    • High aggregate performance when there are large number of clients
    This system has been developed to support Google’s application workloads and technological environment. GFS cluster consists of one master and multiple chunkservers. This can be accessed by multiple clients.IN this system, files are divided into fixed size chunks. Each chunk is assigned a 64 bit chunk handled by the master when it is created. This handle serves as a unique identifier. For reliability, every chunk is replicated on more than one server (default is 3 servers). Master communicates with chunkservers with HeartBeat messages to give messages and also to collect its health state.
    Yahoo HDFS – Yahoo uses Hadoop Distributed File System (HDFS). It is a distributed file system designed specifically to store large amounts of data and provide high-throughput access to this data. Similar to GFS, data files are stored in multiple servers increasing redundancy which provides more reliability. The design of HDFS is based on GFS. Each file is broken down into blocks of fixed size and each these data blocks are stored across a cluster consisting of one or multiple machines. These machines are referred to as DataNodes. The default block size used in HDFS is 64 MB. Similar to GFS, the blocks are replicated on 3 nodes by default.

    BigData – BigData refers to data sets which are very large and hence not optimal to be processed by traditional database systems. This is because traditional database management systems take huge amount of time when data is huge and also many a times they are not capable to handle such huge data. As more and more data centric applications evolved which work on the data captured, derive information by analyzing the data captured, form new data based on the data and analysis available, amount of dat grew tremendously leading to new terms and technologies. Google’s MapReduce framework is one of the key innovations to handle BigData.
    BigTable – It is a distributed file system used for handling data that is expected to scale to very large sizes. Google uses this extensively to store data. Some of the key features are –
    • Wide availability
    • Scalability
    • High availability
    • High performance
    It is a sparse, distributed, multi-dimensional persistent map. Here, the map is indexed by row key, column key, timestamp. Each value in the map is an array of bytes. BigTable uses GFS which is described above. It operates on a shared pool of machines to run applications.

    Key reasons for choosing this architecture-
    1. Ability to handle huge amount of data
    2. High availability
    3. Scalability as the data size scales up
    4. High throughput when processing large amount of data
    5. Providing reliability by increasing redundancy
    6. Derive useful information from large data in quick time
    7. Efficiency when using commodity hardware

  4. manishekar

    Google uses BigTable database architecture.
    Yahoo uses PNUTS (Platform for Nimble Universal Table Storage) database architecture.

    BigTable is database architecture for managing huge amounts of data. It is a type of distributed database storage system in which data is organized as tables. These tables contain a number of rows and columns. Unlike traditional relational database, BigTable is a sparse multidimensional map. Each cell of BigTable contains a time stamp. These time stamps enable to provide multiple versions of a cell. BigTable is built upon the Google File System. Data is organized in order by row key, and indexing of the table map is done using the row, column keys and timestamps. BigTable consists of many tablets. These tablets contain limited number of rows. These tablets help in distributing the data across different servers. BigTable provides a flexible, high-performance solution to many Google products that use BigTable.

    PNUTS is a type of distributed database system. It is mainly used in Yahoo’s web applications. PNUTS provides competitive latency and throughput in its operation. Data is divided and duplicated over a number of servers. This database architecture provides easy access of the underlying data while hiding the internals of the database. Data is organized as ordered or hashed tables. PNUTS helps to solve the problem of scaling a system to multiple data centers. It is suitable for large number of concurrent requests including updates and queries. It supports automatic load balancing

    Big Data is a collection of huge information. This information is a data set that is extremely large compared to traditional databases. These data sets cannot be handled by conventional software. In Big Data, the data is gathered from many sources such as sensors, visual perception, mobiles, environment etc. Thus Big Data applications are fed by a wide array of unstructured and semi-structured data sources that includes the Internet, PCs, smartphones, tablets, machines, multimedia applications, and a variety of other connected devices. To handle this information, the required software needs to run on thousands of servers. Processing of this information provides many business opportunities. Big Data is all about seeing and understanding the relations within and among pieces of information.

  5. ramya

    Database architecture used by Yahoo and Google.
    There are four most popular types of database systems currently used by top corporations like Yahoo, Facebook, Google, Misrosoft, Amazon, etc:
    Massively Parallel Processing (MPP) or parallel DBMS
    Column-oriented database
    Streaming processing (ESP or CEP)
    Key-value storage (with MapReduce programming model)
    For managing large scale data Key-value storage (with MapReduce programming model) which is also called Big data or Big Table is used extensively.
    About bigdata and bigtable:
    Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large that it’s difficult to process using traditional database and software techniques.
    While the term may seem to reference the volume of data, that isn’t always the case. The term big data — especially when used by vendors — may refer to the technology (which includes tools and processes) that an organization requires to handle the large amounts of data and storage facilities.
    The term big data is believed to have originated with Web search companies who had to query very large distributed aggregations of loosely-structured data.

    Bigtable:
    Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products.

    few characteristics of BigTable:
    map :
    A map is an associative array; a data structure that allows one to look up a value to a corresponding key quickly. BigTable is a collection of (key, value) pairs where the key identifies a row and the value is the set of columns.
    persistant :
    The data is stored peristantly on disk.
    distributed :
    BigTable’s data is distributed among many independent machines. At Google, BigTable is built on top of GFS (Google File System). The Apache open source version of BigTable, HBase, is built on top of HDFS (Hadoop Distributed File System) or Amazon S3. The table is broken up among rows, with groups of adjacent rows managed by a server. A row itself is never distributed.
    sparse :
    The table is sparse, meaning that different rows in a table may use different columns, with many of the columns empty for a particular row.
    sorted :
    Most associative arrays are not sorted. A key is hashed to a position in a table. BigTable sorts its data by keys. This helps keep related data close together, usually on the same machine — assuming that one structures keys in such a way that sorting brings the data together. For example, if domain names are used as keys in a BigTable, it makes sense to store them in reverse order to ensure that related domains are close together.

    Advantages of bigdata over rdbms: reasons for choosing such an architecture.
    The main advantages are: – Join operations are less costly because of the denormalization – Replication/distribution of data is less costly because of data independence (ie, if you want to distribute data across two nodes, you probably won’t have the problem of having an entity in one node and other related entity in another node because similar data is grouped)
    This kind of systems are indicated for applications that need to achieve optimal scale (ie, you add more nodes to the system and performance increases proportionally). In an ORM like MySQL or Oracle, when you start adding more nodes if you join two tables that are not in the same node, the join cost is higher. This becomes important when you are dealing with high volumes.
    ORMs are nice because of the richness of the storage model (tables, joins, fks). Distributed databases are nice because of the ease of scale.
    SOME FEATURES:
    • fast and extremely large-scale DBMS
    • a sparse, distributed multi-dimensional sorted map, sharing characteristics of both row-oriented and column-oriented databases.
    • designed to scale into the petabyte range
    • it works across hundreds or thousands of machines
    • it is easy to add more machines to the system and automatically start taking advantage of those resources without any reconfiguration
    • each table has multiple dimensions (one of which is a field for time, allowing versioning)
    • tables are optimized for GFS (Google File System) by being split into multiple tablets – segments of the table as split along a row chosen such that the tablet will be ~200 megabytes in size.

  6. leekris

    The architecture of a database system is influenced by underlying computer system. Google uses a database system called Big Table. Google database data is extracted by upper layers and applications. Big Table is a distributed storage system for structured data. It is Google proprietary built on Google systems and technologies. It is used for all large scale operations like search, email and indexing.

    Big Table is not a relational database and is designed to scale to a very large size. Each table consists of rows, columns and each cell has a timestamp. Row range is dynamically partitioned into tablets i.e. sequence of rows. A tablet is roughly 200 MB, and each machine saves about 100 tablets. This setup allows tablets from a single table to be spread among many servers. Big Table also relies on Chubby which is a highly reliable distributed lock service. Big Table architecture consists of Master servers, Tablet servers and Lock servers.

    This kind of architecture gives Google the advantages like fast and extremely large scale database management systems, easier up scaling to handle higher amounts of data. Also the tables have multiple dimensions allowing versioning and are optimized for Google File System. Big Table gives a flexible and high performance solution for varied demands of different Google projects like Google Earth, Google Finance and others.

    Yahoo has the world’s largest database of 1 peta byte in production environment and is built in such a way that it can be scaled to tens of peta bytes. Yahoo database system is built using commodity Intel boxes strung together in clusters of huge size. Massive Parallel systems are employed with Distributed Columnar Storage. Yahoo uses PostgreSQL with a modified query processing layer designed for commodity hardware cluster.

    Big Data is a collection of data sets which are large and complex and cannot be handled with traditional database tools. It is order of few tera bytes to peta bytes. The data yahoo collects is structured data and is basically about advertising on the website and how the consumers feel about the navigation and usage of website. The purpose of Big Data is to give insights about the consumers and give them best experience and to make business profitable.

    The kind of database system used by Yahoo makes it effective for data querying especially in Big Data analytics queries. Also it enables to use advanced techniques for data compression, multi level data and query partitioning, and vector query processing for efficient parallel processing.

  7. Niki

    Google uses Big Table (A Distributed Storage System for Structured Data) and Yahoo uses Apache Hadoop.

    Big Table:
    Big Table is a distributed storage system (built by Google) that is structured as a large table: one that may be petabytes in size and distributed among tens of thousands of machines. It is designed for storing items such as billions of URLs, with many versions per page; over 100 TB of satellite image data; hundreds of millions of users; and performing thousands of queries a second.

    Example:

    Properties:
    • fast and extremely large-scale DBMS
    • a sparse, distributed multi-dimensional sorted map, sharing characteristics of both row-oriented and column-oriented databases.
    • it is easy to add more machines to the system and automatically start taking advantage of those resources without any reconfiguration
    • each table has multiple dimensions (one of which is a field for time, allowing versioning)
    • Tables are optimized for GFS (Google File System) by being split into multiple tablets – segments of the table as split along a row chosen such that the tablet will be ~200 megabytes in size.

    Big Data:

    Big data refers to large datasets that are challenging to store, search, share, visualize, and analyze. At first glance, the orders of magnitude outstrip conventional data processing and the largest of data warehouses. For example, an airline jet collects 10 terabytes of sensor data for every 30 minutes of flying time. Compare that with conventional high performance computing where New York Stock Exchange collects 1 terabyte of structured trading data per day. Compare again to a conventional structured corporate data warehouse that is sized in terabytes and petabytes.
    Big Data is sized in peta-, exa-, and soon perhaps, zetta-bytes! And, it’s not just about volume, the approach to analysis contends with data content and structure that cannot be anticipated or predicted.

    Storage and Management Capability : Hadoop Distributed File System (HDFS), Cloudera Manager. Processing Capability: MapReduce, Apache Hadoop.

  8. sthiaga1

    NoSQL, is the database architecture that is been used by both – Google and Yahoo for their search engines. These NoSQL databases are generally non- relational, distributed, open-source and horizontally scalable. These databases have become an increasingly important part of the database landscape which can offer real-benefits.
    Big data, as the name suggests, they are huge in terms of four dimensions – Volume, Velocity, Variety, Veracity. Big data is more than simply a matter of size; it is an insight to find new types of data and content, to make the transactions in a database more agile. Big data makes the information transparent which makes the users to create and store more transactional data in digital form.
    Big table, which is developed by Google, is a distributed storage system for managing structured data that is designed to scale very large sized data’s (in petabytes). It is a sparse, distributed, and a persistent multi-dimensional sorted map which is indexed by a key and a time stamp. These Big tables are not the traditional relational databases and they are not commercial databases which do not pass through the ACID test. The Big table uses Google File system to store data and log files which provides the cluster management system that schedules and manages the big table’s cluster. Yahoo uses, once such similar architecture which uses Apache Hadoop. They are written in Java, and can able to work in cross-platform operating systems which have their file system distributed. Having such advantages in handling large sets of data Google and Yahoo uses these kinds of non-relational database architectures.

Leave a Reply