PRIYANKA VERGADIA: Welcome to"Google Cloud Drawing Board,"where we doodle ourway through the cloud.Today's topic, What is Dataflow?This video is dividedinto chapters.So watch the fullvideo or skip aheadto any section of your choice.Data is generated in real timefrom web sites, mobile apps,IoT devices, andother workloads.Capturing, processing,and analyzing this datais a priority forall businesses.But data from thesesystems is not oftenin the format that'sconducive for analysisor for effective useby downstream systems.That's where Dataflow comes in.Dataflow is a serverless, fast,and cost-effective servicethat supports both streamingand batch processing.It provides portabilitywith processing jobswritten using open-sourceApache Beam libraries.And it removesoperational overheadfrom your data engineeringteams by automatingthe infrastructure provisioningand cluster management.How does Dataflow work?In general, a data processingpipeline involves three steps.You read the data froma source, transform it,and write the databack into a sink.The data is read fromthe source into somethingcalled a PCollection.The P stands for Parallel,because a PCollectionis designed to be distributedacross multiple machines.Then it performs one or moreoperations on the PCollection,which are called transforms.Each time it runs a transform,a new PCollection is created.That's becausePCollections are immutable.After all of thetransforms are executed,the pipeline writesthe final PCollectionto an external sink.Once you've createdyour pipelineusing Apache Beam SDK in thelanguage of your choice, Javaor Python, you can useDataflow to deploy and executethat pipeline, which iscalled a Dataflow job.Dataflow then assigns theworker virtual machinesto execute the data processing.You can customize the shapesand size of your machines.And if your trafficpattern is spiky,Dataflow autoscalingautomaticallyincreases or decreases thenumber of worker instancesrequired to run your job.Dataflow's streamingengine separatescompute from storage and movesparts of pipeline execution outof the worker VMs and intothe Dataflow service backend.This improves autoscalingand data latency.Now, how do you use Dataflow?You can createDataflow jobs usingCloud Console UI, gcloud commandonline interface, or the API.You have options forcreating the job.You can use Dataflow templates,write a SQL statement,or use AI Platform Notebooks.Dataflow templatesoffer a collectionof prebuilt templateswith an optionto create your own custom ones.You can then easilyshare them with othersin your organization.Dataflow SQL lets youuse your SQL skillsto develop streaming pipelinesright from the BigQuery web UI.You can join streamingdata from Pub/Subwith files in cloud storageor tables in BigQuery,write results into BigQuery, andthen build real-time dashboardsfor visualization.You can use AI PlatformNotebooks from the Dataflowinterface itselfto build and deploydata pipelines using thelatest data science and machinelearning frameworks.Dataflow inline monitoring letsyou directly access job metricsto help with troubleshootingpipelines at boththe step and the worker level.How to secureDataflow pipelines?When using Dataflow,all the datais encrypted atrest and in transit.To secure the dataprocessing environment,you can turn off publicIPs to restrict accessto internal systems,leverage VPC servicecontrols that help you mitigatethe risk of data exfiltration.Additionally, youcan create a pipelinethat's protected with customermanaged encryption keys.Now, what does itcost to use Dataflow?Dataflow service usage isbilled in per-second incrementson a per-job basis, dependingon streaming or batch data.For batch dataprocessing, you canutilize the flexibleresource schedulingfeature that reduces the costby using advanced schedulingtechniques.Each Dataflow job uses atleast one Dataflow worker,and the price depends onthe worker configurations.Where to use Dataflow?Dataflow is a great choice forany batch or streaming datathat needs processingand enrichmentfor the downstream systemssuch as analysis, machinelearning, or data warehouse.Some examples are streamanalytics enablingreal-time businessinsights; real-time AIenabling predictiveanalytics, fraud detection,personalization; processingstreams of log data,unlocking health insightsfor your systems;and of course, anyother data aggregationand analysis scenario.Want to learn moreabout Dataflow?Check outcloud.google.com/dataflow.