SPEAKER: Welcome backto another episodeof "BigQuery Spotlight."Today, we're talking aboutBigQuery data governanceso that everyone inyour organizationcan easily find andleverage the data theyneed to makeeffective decisions,while minimizing overallrisk and ensuringregulatory compliance.Let's jump in.[MUSIC PLAYING]The term data governance isthrown around quite a bit,but what does it actually mean?Data governanceis everything youdo to ensure your data issecure, private, accurate,available, and usableinside of BigQuery.For us, governance startsduring data onboarding.Let's say your team hasreceived a new request.Someone from youre-commerce departmentwants to add transactions datainto a new BigQuery table.The first step inensuring governanceis understanding the data.For example, figuring outwhere the data is coming fromand what informationit contains.In this case, thetransactions willbe extracted from adatabase, and you alreadyknow the schema.So let's start by creatinga new blank BigQuery table.Next, you'll need to makesure that sensitive data isprotected.To do so, you'll want tocategorize the informationin your new table.For example, markingspecific columnsas having personallyidentifiable information.You might remember thattables can have labels.While labels are a greatway to easily categorizeBigQuery objects, theyare a bit limited,and can't be usedfor specific columns.Lucky for us, BigQueryhas deep integrationswith Data Catalog, GoogleCloud's data discoveryand metadata management service.Data Catalog automaticallytracks technical metadatafor BigQuery assets.So we can already findour new BigQuery table,and see things like columnnames, descriptions and whenit was created.But the really coolpart is that youcan create schematized tagsthat act as annotationsto capture metadata.We've already createda tag templatethat tracks datagovernance information, sonow let's fill it in and attachthe tag to the Email column.Great.Now we're tracking that thisentity contains customer emailaddresses.But you might also noticethere's a Comments columnthat stores freeformtext customers haveadded to their order.There's a chance thatsome of these rowscontain moresensitive information.This is where DLP comes in.DLP, or Cloud DataLoss Prevention,is Google Cloud'sservice designedto help you discover and protectyour most sensitive data.With DLP, you can create aninspection job that scanscomments for email addresses.DLP can scan data that livesin Data Store, Google CloudStorage, or BigQuery,and can send youan email with the scan results,save the results of thatto a table inside of BigQuery,or even write the resultsdirectly into Data Catalog.Many BigQuery users chooseto leverage an orchestrationtool, like CloudComposer, to builddata onboarding pipelines.For example, you mightload your transactions datain a locked down stagingproject inside of BigQueryand kick off a DLP scanwhen new comments are added.The results will be listed inthe staging table's metadatainside of Data Catalog.If the scan shows that anysensitive information isincluded in a comment,then you can alsokick off a Didentification workflow,where DLP replacesPII with somethinglike an asteriskor a hash symbol.Then, the new data is finallyappended to the productionBigQuery table.Now that you've de-identifiedand classified your data,you're ready to designaccess policies.To understand dataaccess in BigQuery,you need to familiarizeyourself with Identity AccessManagement, or IAM.If you haven't already watchedour video on BigQuery IAM,I recommend pausingand taking a look.We've linked it inthe comments below.Great, so on to data sharing.When it comes to grantingaccess to BigQuery data,many administratorschoose to grantGoogle Groups representing yourcompany's different departmentsaccess to specific datasets or projects so policiesare simple to manage.Say you have a project thatstores your organization'se-commerce data.You might give the e-commerceteam the BigQuery Data Viewerrole in this project.Now, if someone onthe e-commerce teamneeds data froma different team,like the productdevelopment department,you may be inclinedto just grant themaccess to that data set.Instead, you could usean authorized view,or an authorized UDF, so thatyour e-commerce analyst canget limited accessto product datawhile still keeping moremanageable team to projectpermissions.So this all sounds great, butwhat about those PII columnsthat we flagged beforein Data Catalog?You definitely don'twant the e-com teamto have access to those, unlessthey have special clearancefrom security.With BigQuery, youcan control accessto specific columns usingpolicy tags from Data Catalog.Here, we've createda taxonomy, whichclassifies the level ofclearance needed to accessdifferent types of data.Anyone who needs access toour highly sensitive datashould be added as afine-grained readerto the high resource.So in our case, we'll just addthe e-com PII clearance group.Next, we can applythese as policy tagsto columns in ourBigQuery table,marking the columns withPII as highly sensitive.And to take it onestep further, maybe youdon't want all e-commercefolks to get accessto the transactions data.Instead, eachanalyst should onlybe able to viewtransactions for the productcategory they work on.You can control accessto specific rowsin BigQuery using arow-level access policy,like the one we're showing here.Between storing datain different containersand leveraging authorizedviews, UDFs, policy tags,and row-levelaccess policies, youhave a lot of optionsfor sharing datain a secure manner.Check out the documentationbelow for some more guidanceon pros and consand general bestpractices for policy access.OK, so we've onboarded ourdata set and we've shared it.The last step we'll talkabout is ongoing monitoring,including ensuringthat data is accurate.This is where dataquality comes in.There are lots ofdifferent toolsthat allow you to declarequality validationtests by writing customassertions using SQL.For example, you may wantto check and make surethat you don't haveany duplicate entriesin your table.We've actually createdan open source frameworkfor data qualityvalidation that'sincluded in the comments below.And keep an eyeout for Data Form,our tool for developingSQL pipelines,which will be open tonew customers soon.Just like wementioned before, youcan use an orchestrationtool like Cloud Composerto programmaticallyrun these testsand alert you ifsomething seems wrong.You can also use the DataCatalog API to make surethat quality metadatais added to tagsand is easily discoverable.Aside from monitoringquality, youmight also want to monitorwho's accessing the dataand how they're using it.Well, you're in luck.In a few weeks, we're divinginto monitoring your BigQuerydeployment, so besure to tune in.And remember, stay curious.[MUSIC PLAYING]