Central Services Portal is in Beta. We need your feedback to ensure the best possible user experience. Any feedback you provide will help us help you! Please use the feedback tool at the bottom of any page.

Search Icon

Central Services Portal is in Beta. We need your feedback to ensure the best possible user experience. Any feedback you provide will help us help you! Please use the feedback tool at the bottom of any page.

Cloud Data Lake

What is it?

The Cloud Data Lake platform is available to land data directly from Open Ingest. In scenarios where Open Ingest is not an option to automatically create a subject in the Data Foundry, users can leverage Cloud Data Lake to onboard data.

Cloud Data Lake also gives you the ability to understand what data exist in the Data Lake through crawling, cataloging, and indexing of data. 

The CPP-compliant Cloud Data Lake is a designated data repository for legal hold and can support sensitive non-public data.

Who is it for?

This Cloud Data Lake platform is designed for teams that:

  • Need to ingest and store large volumes of data from various sources, including those not directly supported by Open Ingest.
  • Require a centralized data repository for cataloging, indexing, and understanding the contents of their data.
  • Handle sensitive non-public data and need a compliant solution for legal hold requirements.
  • Value the flexibility and scalability of a cloud-based data lake platform.

How to use it?

Onboarding Process

  1. Submit a Jira request for a new subject.
  2. New Partners, Domains and Subdomains can be provisioned as needed if the current hierarchy is not applicable for the new dataset.  Requests for new partners requires a ticket and can be submitted here.
    • Domain and subdomain names must not contain organizational acronyms because organization names change. The domain or subdomain name must reflect the business function rather than the organization which currently delivers that function.
    • Domain names must be between 5 and 50 characters length. 
    • Subdomain names must be between 3 and 50 characters length. Only lowercase alphabetical, numerical, and underscore characters are allowed.
    • Both domains and subdomains require a description that explains what the data category represents in terms that external teams can understand.
  3. If your data ingest is less than 100 GB per day, no additional cost estimation is required
  4. For Base, Semantic and Direct to PA subjects, teams must adhere to the requirements of a dx Cloud Data Lake Custom Data Producer.
  5. If the request requires additional cost estimation, the ticket will be routed to the dx TCO team to conduct a cost analysis and approval.
  6. If the source data format is JSON, then a JSON schema must uploaded to the Data Stream Builder and “Validate for AWS Data Lake” must be selected.  See additional JSON Schema information below.
  7. Once data is ingested into Cloud Data Lake, by default, you as the requestor become the owner of the data.
    • If an alternative person should own the dataset, please specify this when you create your ticket.
  8. Data owners must specify a retention policy that is compliant with Comcast Records & Information Management Retention Schedule Records Series.
  9. Data owners are required to curate PI elements found in their data using the MyCPP portal.
  10. A valid application ID must be specified, you can create a new one in DevHub
  11. As privacy requirements evolve, data owners may be required to support Data Management Remediation retention and IRR download and IRR deletion. This could result in mandatory schema changes, separating PI elements from the transactional attributes, providing cross reference data and supporting SQL.
wpChatIcon
wpChatIcon
Provide Feedback