The first complete ETL in the web mining context
We have been doing research in the field of web scraping and web mining for over ten years using the most advanced technologies and algorithms
Spark and big-data are our daily bread. Web scraping will no longer be an extemporaneous process to deal with other data analytics activities
We orchestrate instances of headless browsers by integrating them directly into our data acquisition pipeline
We guarantee higher dynamic page acquisition levels than our main competitors thanks to the integration of our apache spark engine
We will integrate the most efficient, precise and robust wrapper induction algorithms into our platform as well as the literature
academic offers us
We guarantee freedom in the definition of scraper by the most expert users allowing the definition of client side code and server side in python and nodejs
We will allow you to use your external proxy rotation service but we will include a proxy rotation system in the subscriber thanks to the collaboration with our partner
We use a dedicated DSL to define the processing pipeline and the acquisition logic. In perspective, this tool will allow us to include full data analytic functionalities
We will release open source tools and browser extensions for assisted query query generation, page annotation, execution and local testing of the generated query.
From the consumer user to the enterprise company
Trust us for your business success. Try out our service on a serverless configuration and with 1 worker.
Up to 500,000 pages processed per month included on serverless architectures with 5 workers.
Up to 1,000,000 pages processed per month included on serverless architectures with 10 workers.
Dedicated spark cluster for massive acquisitions. Guaranteed level of parallelism on demand.
In today's market, every company that wants to achieve, maintain or improve its success, needs to obtain, organize and manage more and more information (data) on the market itself, on customers and competitors. Those who manage to do it better and better will always be more successful in their sector. However, the network is made up of an innumerable source of unstructured or semi-structured data that requires a collection activity that is, very often, very expensive in relation to the necessary technical commitment. In most cases, this commitment takes the form of dedicated solutions that cannot be scaled both from an infrastructural and algorithmic point of view and, at the same time, are not very robust (the web is constantly changing its structure) Our service will offer an efficient solution to this problem, guaranteeing structural and algorithmic scalability as well as offering a web service that can be integrated into the more general web-mining process. We draw on the state of the art of cloud and big-data technologies, as well as on the most refined automatic extraction algorithms in order to guarantee more robust and maintainable solutions. All this thanks to the support of visual modeling tools and dedicated machine learning algorithms. Thanks to WebRobot, they will be able to concentrate exclusively on the integration with their applications and / or stacks thanks to the comfortable SDKs we will provide. In a long-term context, WebRobot wants to become a complete ETL service, involving data extraction, web mining, machine learning and big data analytics.
Web scraping is often used by companies that provide competition monitoring services (price comparison) in an algorithmic pricing logic of price lists for realities such as e-commerce. It can also be used by companies that provide press reviews and social media analysis services that, specifically, provide opinion mining and sentimental analysis services that are not limited exclusively to social networks. Furthermore, the data collected can help feed indices that can be used to create vertical search engines. An interesting application could be linked to the implementation of sentiment indicators that can be exploited in the context of algorithmic trading for financial applications. Hedge funds will in fact be our main target. In addition, some creative applications of web bots could be linked to SEO and / or web marketing automation. In general, any company can gain added value by refining the acquired data can be a potential customer of the platform.
Our platform will run on stack aws and will use both serverless (aws lambda) approaches for trial, entry level and professional packages, and aws EMR for enterprise plans. We will use aws api gateway for the management of our main apis that will be object of monetization on amazon marketplace and we will use data analytic tools of support like aws athena and no sql database as dynamo db for what concerns the persistence. Our main choice as headless browser technology remains the excellent phantomjs, although we will be open to integration with the most recent headless chromium. We will have an autoscaling scheme in order to make the variable costs of our infrastructure flexible. Our ETL is developed using parser generation technologies (ANTLR) and will be constantly evolving to converge towards our complete ETL vision. The wrapper induction algorithms that we will offer, currently exposed by an internal API on the elastic beanstalk context, will be progressively integrated into the acquisition framework on spark technology and Java / scala languages. We will develop visual support tools (mainly browser extensions) on nodejs technology to support the less experienced user in defining the ETL. The research on serverless computation and consequent dynamic allocation of executors, opens new horizons for deploying on decentralized contexts on DFINITY or IEXEC technologies and we will be constantly open to the new paradigms that DLT technologies will be able to bring out. The dashboard consists of a developer portal written in reactjs on serverless architecture and will be completely free.