One of the early stages in a data pipeline is data ingestion, which is where large amounts of streaming data are received. Data however, may not always come from a single structured database, instead the data might stream from a thousand or even a million different events, that are all happening asynchronously. A common example of this is data from IoT or Internet of Things applications. These can include sensors on taxes that send out location data every 30 seconds, or temperature sensors around the data sensor, to help optimize heating and cooling. These IoT devices present new challenges to data ingestion, which can be summarized in four points. The first is that data can be streamed from many different methods and devices, many of which might not talk to each other and might be sending bad or delete data. The second is that it can be hard to distribute event messages to the right subscribers, event messages or notifications. A method is needed to collect the streaming messages that come from IoT sensors, and broadcast them to the subscribers as needed. The third is that data can arrive quickly and at high volumes, services must be able to support this. And the fourth challenge is ensuring services are reliable, secure, and perform as expected. Google Cloud has a tool to handle distributed message oriented architectures at scale, and that's Pub/Sub. The name is short for publisher subscriber, or publish messages to subscribers. Pub/Sub is a distributed messaging service that can receive messages from a variety of device streams, such as gaming events, IoT devices, and applications streams. It ensures at least once delivery of received messages to subscribing applications with no provisioning required. Pub/Sub's APIs are open, the service is global by default, and it offers end-to-end encryption. Let's explore the end-to-end Pub/Sub architecture. Upstream source data comes in from devices all over the globe, and it's ingested into Pub/Sub, which is the first points of contact within the system. Pub/Sub reads, stores, and broadcasts to any subscribers of this data topic that new messages are available. As a subscriber of Pub/Sub, data flow can ingest and transform those messages in an elastic streaming pipeline and output the results into an analytics data warehouse like BigQuery. Finally, you can connect to data visualization tool, like Looker or Data Studio to visualize and monitor the results of a pipeline, or an IOI, or ML tools such as Vertex AI, to explore the data to uncover business insights or help with predictions. Central elements of Pub/Sub is the topic. You can think of a topic like a radio antenna. Whether your radio is playing music or it's turned off, the antenna itself is always there. If music is being broadcast in a frequency that nobody is listening to, the string of music still exists. Similarly, a publisher can send data to a topic that has no subscriber to receive it. Or a subscriber can be waiting for data from a topic that isn't getting data sent to it, like listening to static from a bad radio frequency. Or you could have a fully operational pipeline, where the publisher is sending data to a topic that no application is subscribed to. That means there can be zero, one, or more publishers and zero, one, or more subscribers related to a topic, and they're completely decoupled so they're free to break without affecting their counterparts. It's helpful to describe this using an example. Say you've got a human resource as a topic. A new employee joins your company, and several applications across the company needs to be updated. Adding a new employee can be an event that generates a notification to the other applications that are subscribed to the topic, and they'll receive the message about the new employee starting. Now, let's assume that there are two different types of employee; full-time employee, and a contractor. Both sources of employee data could have no knowledge of the other, but still publish their events saying this employee joined into the Pub/Sub HR topic. After Pub/Sub receives a message, downstream applications like the directory service, facilities system, account provisioning, and badge activation systems, can all listen and process their own next steps independent of one another. Pub/Sub is a good solution to buffer changes for lightly coupled architectures, like this one that have many different sources and sinks. Pub/Sub supports many different inputs and outputs, and you can even publish a Pub/Sub event from one topic to another. The next task is to get these messages reliably into our data warehouse, and we'll need a pipeline that can match Pub/Sub scale and elasticity to do it.