Multimodal Data Survey Results + Insights from Guidepoint Interviews

Table Of Contents:

Key Results [BiotechX]:

142 respondents from biopharma [biotech, pharma, CROs, consultants]
84.5% [115] consider using multimodal data in R&D strategy is both urgently needed and important (score 5 and up)

1
8/10 median score for importance
2
7/10 median score for urgency of use

The following 3-7 are computed on the 115 responses that consider multimodal data both urgent and important.

How well are current solutions serving them?

1
88% are dissatisfied with current ways of governance and compliance on their data and report major issues with current solutions
2
90.5% say its very hard or moderately hard to store and catalog different data modalities side-by-side e.g. -omics data, assays, clinical trial data, imaging, literature

Data types that are most challenging to handle

1
Omics data – 46.9%
2
Images - 38.2%

Specific top challenges:

1
Multimodal data is very complex to integrate and needs a lot of engineering resources - 42.6%
2
Data is high volume and we are unable to move it around easily for collaboration - 26%
3
It is hard to track all our data and analyses to confidently look back for reproducibility and audit compliance - 26%

Top Must-Haves in a multimodal platform:

1
Work with data of all kinds including large scale omics, assays, clinical trials, literature, and biomedical images – 52.1%
2
Have in-built capabilities for data science and machine learning – 31.7%
3
Be compliant with enterprise-grade security features - 30.3%
4
Be able to connect all data assets for a project into a single space: data such as omics data, assays, literature, ancillary measurements, and reports they exchange with each other - 30.3%

Top Nice-to-Haves:

1
Work as a central place for cataloging all relevant data for R&D – 23.2%
2
Have in-built capabilities for data science and machine learning – 22.5%
3
Connect with major lab instruments to store raw data directly in one place - 19.7%

Synthesis of key insights from 1-1 interviews with pharma/biotech R&D leaders, focusing on common themes and notable differences.

Key Cross-Cutting Themes:

1. Multi-Omics Data Integration & Complexity

- Virtually all companies handle multiple types of omics data (genomics, proteomics, transcriptomics, spatial)

- Integration of different data types is a universal challenge

- Data volume and complexity require significant computational resources

- Proportion of multi-modal data varies by therapeutic area (70-80% in oncology, 30-50% in other areas)

2. Organizational Structure & Collaboration

- Mix of centralized and embedded models for data science/IT teams

- Close collaboration between computational teams and wet lab scientists

- Bioinformatics teams often act as bottlenecks due to high demand

- Trend toward cross-functional teams for drug discovery programs

3. Data Infrastructure

- Most use cloud-based storage (e.g., AWS S3)

- No single unified data platform across organizations

- Strong emphasis on data governance and security

- Mix of internal and external data sources (clinical trials, public databases, partnerships)

4. Analysis Tools & Workflows

- Heavy reliance on both commercial and open-source tools

- Common use of Python/R for analysis

- Growing interest in natural language interfaces

- Need for tools that support both power users and non-technical scientists

Unique Approaches & Differences:

1. Company-Specific Focus

- Kite Pharma: Specialized in CAR-T therapy, heavy focus on immune cell analysis

- AstraZeneca: Strong emphasis on oncology and spatial biology

- Sanofi: Moving away from dashboards to dedicated UIs

- Various approaches to building vs. buying solutions

2. Tool Selection

- Some companies heavily invest in internal platforms

- Others prefer vendor solutions with customization

- Varying levels of adoption of advanced technologies (spatial omics, AI)

Common Pain Points:

1. Data Management

- Difficulty in harmonizing data from different sources

- Challenges in data findability and accessibility

- Need for better data curation and standardization

- Complex compliance and security requirement

2. Analysis Bottlenecks

- Overloaded bioinformatics teams

- Long wait times for analysis requests

- Need for more self-service tools for scientists

- Challenge of maintaining analysis reproducibility

3. Integration Challenges

- Lack of standardization across data types

- Difficulty in connecting multiple data modalities

- Need for better visualization tools

- Challenges in managing data access and permissions

Future Needs:

1. Better Tools

- Natural language interfaces for data exploration

- Improved visualization capabilities

- Tools that support both basic and advanced users

- Better integration of multiple data types

2. Infrastructure

- More scalable storage solutions

- Better data governance tools

- Improved data sharing capabilities

- More efficient processing pipelines

3. Collaboration

- Tools that facilitate better team communication

- Improved sharing of analysis results

- Better documentation and reproducibility

- More efficient workflow management

Designing a comprehensive data platform based on insights from these interviews

Key Features and Components:

1. Data Ingestion & Processing

- Flexible ingestion pipelines supporting multiple data types (genomics, proteomics, imaging)

- Automated quality control and validation

- Standardized data processing workflows

- Support for both batch and streaming data

- Data harmonization and integration capabilities

2. Storage & Organization

- Hierarchical storage architecture (hot/warm/cold data)

- Support for structured and unstructured data

- Automated metadata generation and management

- Version control for datasets

- Efficient handling of large-scale omics data

3. Access Control & Security

- Role-based access control (RBAC)

- Granular permissions at dataset level

- Audit logging and tracking

- Compliance with regulatory requirements (GDPR, HIPAA)

- Data encryption at rest and in transit

4. User Interfaces

a. Natural Language Interface

- Query construction from natural language

- Context-aware query suggestions

- Support for complex biological questions

- Integration with domain knowledge bases

b. Visual Interface

- Interactive dashboards

- Custom visualization templates

- Drag-and-drop analysis tools

- Collaboration features

c. Code Interface

- Support for Python/R

- Jupyter notebook integration

- API access

- Pipeline development tools

5. Analysis Capabilities

- Standard analysis workflows (differential expression, pathway analysis)

- Custom pipeline development

- Integration with existing tools (BioTuring, Rosalind, etc.)

- Machine learning model development and deployment

- Result caching and sharing

6. Governance & Management

- Data lineage tracking

- Usage analytics

- Resource monitoring

- Cost management

- Automated backup and archival

7. Collaboration Features

- Shared workspaces

- Analysis sharing and reproduction

- Comment and annotation tools

- Version control for analyses

- Export to presentation formats

8. Integration Capabilities

- APIs for external tool integration

- Support for common data formats

- Plugin architecture for extensibility

- Connection to public databases

- Integration with existing infrastructure

9. Performance Features

- Distributed computing support

- Caching layer for frequent queries

- Optimized storage for different data types

- Automated scaling based on demand

- Performance monitoring and optimization

10. Documentation & Support

- Interactive tutorials

- Context-sensitive help

- Best practices guides

- API documentation

- Training materia

Implementation Considerations

1. Modularity

- Microservices architecture for flexibility

- Clear separation of concerns

- Pluggable components

- Easy updates and maintenance

2. Scalability

- Horizontal scaling capability

- Cloud-native design

- Resource optimization

- Performance monitoring

3. Adoption Strategy

- Phased rollout approach

- Integration with existing workflows

- User training program

- Feedback collection and iteration

4. Cost Management

- Usage monitoring

- Resource optimization

- Cost allocation tracking

- Budget management tools

This platform design addresses the key pain points identified in the interviews while providing the flexibility needed for different types of users and use cases. The modular architecture allows for gradual adoption and integration with existing tools and workflows, which was a common concern across organizations.

Competitive Insights

Based on the interviews and my knowledge of the field, there isn't a single platform that encompasses all these features, but several platforms offer partial solutions:

1. Terra.bio (Broad Institute/Verily)

- Strengths:

* Strong workflow management for genomics

* Good integration with cloud infrastructure

* Collaborative notebooks

* Well-established in genomics community

- Limitations:

* Primarily focused on genomics

* Less support for other omics types

* Limited natural language capabilities

2. Snowflake + Partner Ecosystem

- Strengths:

* Excellent data warehousing capabilities

* Strong security and governance

* Good integration capabilities

- Limitations:

* Not life sciences specific

* Requires significant customization

* Limited support for unstructured data

3. Databricks + Unity Catalog

- Strengths:

* Strong analytics capabilities

* Good support for ML/AI workflows

* Scalable compute

- Limitations:

* Not specialized for biotech/pharma

* Requires extensive customization

* Complex for non-technical users

4. Seven Bridges Platform

- Strengths:

* Good support for genomics workflows

* Strong compliance features

* Cloud-agnostic

- Limitations:

* Limited support for other omics

* Less flexible for custom workflows

* More focused on pipeline execution

5. DNAnexus

- Strengths:

* Strong security and compliance

* Good genomics support

* FDA-approved

- Limitations:

* Limited multi-omics capabilities

* Less flexible for custom applications

* More focused on specific workflows

SaaS (orgs want self-hosted)

The gap between current solutions and the ideal platform exists in several areas:

1. Multi-omics Integration

- Most platforms excel in one data type but struggle with true multi-modal integration

- Limited support for spatial and single-cell data

- Difficulty handling complex relationships between data types

2. User Interface Flexibility

- Few platforms successfully bridge the technical/non-technical user divide

- Limited natural language capabilities

- Complex setup and configuration requirements

3. Analysis Democratization

- Most require significant technical expertise

- Limited self-service capabilities for scientists

- Complex deployment and maintenance

4. Data Harmonization

- Limited automated data harmonization

- Difficulty integrating external and internal data

- Inconsistent metadata handling

5. Visualization and Exploration

- Limited interactive visualization capabilities

- Difficulty handling large-scale data exploration

- Poor support for hypothesis generation

The closest approach currently used by many organizations is a custom stack combining:

- Cloud infrastructure (AWS/Azure/GCP)

- Data warehouse (Snowflake/Databricks)

- Analysis platforms (Terra/DNAnexus)

- Custom interfaces and tools

- Internal data management systems

This suggests there's still a significant opportunity for a platform that can better integrate these capabilities while addressing the specific needs of biotech/pharma organizations. The challenge isn't just technical - it's about creating a solution that can be readily adopted within existing organizational structures and workflows while providing clear value to both technical and non-technical users.

Based on the interviews and information provided about TileDB, here's a comparative analysis:

TileDB's Key Differentiators:

1. Data Model & Storage

- Unique multi-dimensional array storage model well-suited for omics data

- Native support for sparse and dense arrays

- Better handling of complex data types compared to traditional databases

- Direct S3 bucket integration mentioned as valuable by interviewees

2. Flexibility Advantages

- Language-agnostic interface (valued by multiple interviewees)

- Support for both structured and unstructured data

- Good handling of spatial data (mentioned as important by AstraZeneca and Sanofi)

- Ability to handle single-cell data effectively

3. Integration Capabilities

- Mentioned support for key data types: single-cell, bioimaging, notebooks/dashboards

- Chan Zuckerberg single-cell atlas support (noted by Sanofi as impressive)

- Appears more focused on life sciences than general-purpose platforms

Relative Positioning:

Compared to Terra.bio:

+ Better handling of multi-modal data types

+ More flexible data model

- Less established workflow management

- Smaller community ecosystem

Compared to Snowflake:

+ Better suited for scientific data types

+ Native support for multi-dimensional arrays

- Less mature enterprise features

- Smaller partner ecosystem

Compared to Databricks:

+ More specialized for life sciences

+ Better handling of complex scientific data types

- Less developed ML/AI capabilities

- Less mature general analytics features

Compared to Seven Bridges:

+ More flexible data model

+ Better support for multi-modal data

- Less developed pipeline execution

- Fewer pre-built workflows

Compared to DNAnexus:

+ More flexible architecture

+ Better support for diverse data types

- Less established in regulated environments

- Fewer pre-built analysis solutions

Gaps/Opportunities for TileDB:

1. User Interface Development

- Could benefit from stronger natural language query capabilities

- Need for more user-friendly interfaces for non-technical users

- Opportunity to develop better visualization tools

2. Analysis Capabilities

- Could expand pre-built analysis workflows

- Opportunity to develop more automated analysis pipelines

- Need for better support of standard bioinformatics tools

3. Enterprise Features

- Could strengthen governance and compliance features

- Opportunity to develop better collaboration tools

- Need for more robust audit and tracking capabilities

4. Integration Ecosystem

- Could expand partnerships with analysis tool providers

- Opportunity to develop more connectors to common platforms

- Need for better integration with existing workflows

5. Adoption Support

- Could develop more training materials

- Opportunity to create better onboarding processes

- Need for more documentation and examples

Strategic Recommendations:

1. Focus Areas

- Leverage strength in handling complex data types

- Build out more user-friendly interfaces

- Develop stronger integration capabilities

- Expand pre-built analysis solutions

2. Partnership Strategy

- Partner with workflow management platforms

- Integrate with popular analysis tools

- Build relationships with cloud providers

3. Market Positioning

- Position as specialized life sciences platform

- Emphasize multi-modal data capabilities

- Focus on technical differentiators in data model

4. Development Priorities

- Improve user interface and visualization

- Strengthen enterprise features

- Expand analysis capabilities

- Develop better collaboration tools

Based on the interviews, TileDB appears to have strong technical foundations but could benefit from focusing on user experience and enterprise features to better compete with established platforms. The unique data model provides a good foundation for building more specialized life sciences capabilities.

Link to report: https://4zi29whxck3.typeform.com/report/SNUbUU7K/FLrFAlZP3cGSeI0j

The report includes responses from all the questions except question 6. That can be found screenshotted below here. (Reason: Not sure why its not showing up in the report, it only shows up in the preview but does not show up on the final report)

Meet the authors